Cargando…
Top-k dominating queries on incomplete large dataset
Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer US
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/ https://www.ncbi.nlm.nih.gov/pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x |
_version_ | 1783739271091322880 |
---|---|
author | Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab |
author_facet | Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab |
author_sort | Wu, Jimmy Ming-Tai |
collection | PubMed |
description | Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster. |
format | Online Article Text |
id | pubmed-8369331 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Springer US |
record_format | MEDLINE/PubMed |
spelling | pubmed-83693312021-08-17 Top-k dominating queries on incomplete large dataset Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab J Supercomput Article Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster. Springer US 2021-08-17 2022 /pmc/articles/PMC8369331/ /pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab Top-k dominating queries on incomplete large dataset |
title | Top-k dominating queries on incomplete large dataset |
title_full | Top-k dominating queries on incomplete large dataset |
title_fullStr | Top-k dominating queries on incomplete large dataset |
title_full_unstemmed | Top-k dominating queries on incomplete large dataset |
title_short | Top-k dominating queries on incomplete large dataset |
title_sort | top-k dominating queries on incomplete large dataset |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/ https://www.ncbi.nlm.nih.gov/pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x |
work_keys_str_mv | AT wujimmymingtai topkdominatingqueriesonincompletelargedataset AT weimin topkdominatingqueriesonincompletelargedataset AT wumuen topkdominatingqueriesonincompletelargedataset AT tayebshahab topkdominatingqueriesonincompletelargedataset |