Cargando…

Top-k dominating queries on incomplete large dataset

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Jimmy Ming-Tai, Wei, Min, Wu, Mu-En, Tayeb, Shahab
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/ https://www.ncbi.nlm.nih.gov/pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x

_version_	1783739271091322880
author	Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab
author_facet	Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab
author_sort	Wu, Jimmy Ming-Tai
collection	PubMed
description	Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.
format	Online Article Text
id	pubmed-8369331
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-83693312021-08-17 Top-k dominating queries on incomplete large dataset Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab J Supercomput Article Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster. Springer US 2021-08-17 2022 /pmc/articles/PMC8369331/ /pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab Top-k dominating queries on incomplete large dataset
title	Top-k dominating queries on incomplete large dataset
title_full	Top-k dominating queries on incomplete large dataset
title_fullStr	Top-k dominating queries on incomplete large dataset
title_full_unstemmed	Top-k dominating queries on incomplete large dataset
title_short	Top-k dominating queries on incomplete large dataset
title_sort	top-k dominating queries on incomplete large dataset
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/ https://www.ncbi.nlm.nih.gov/pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x
work_keys_str_mv	AT wujimmymingtai topkdominatingqueriesonincompletelargedataset AT weimin topkdominatingqueriesonincompletelargedataset AT wumuen topkdominatingqueriesonincompletelargedataset AT tayebshahab topkdominatingqueriesonincompletelargedataset

Top-k dominating queries on incomplete large dataset

Ejemplares similares