Cargando…

Top-k dominating queries on incomplete large dataset

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Jimmy Ming-Tai, Wei, Min, Wu, Mu-En, Tayeb, Shahab
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer US 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/
https://www.ncbi.nlm.nih.gov/pubmed/34421217
http://dx.doi.org/10.1007/s11227-021-04005-x
Descripción
Sumario:Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.