Cargando…

Top-k dominating queries on incomplete large dataset

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Jimmy Ming-Tai, Wei, Min, Wu, Mu-En, Tayeb, Shahab
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer US 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/
https://www.ncbi.nlm.nih.gov/pubmed/34421217
http://dx.doi.org/10.1007/s11227-021-04005-x
_version_ 1783739271091322880
author Wu, Jimmy Ming-Tai
Wei, Min
Wu, Mu-En
Tayeb, Shahab
author_facet Wu, Jimmy Ming-Tai
Wei, Min
Wu, Mu-En
Tayeb, Shahab
author_sort Wu, Jimmy Ming-Tai
collection PubMed
description Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.
format Online
Article
Text
id pubmed-8369331
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer US
record_format MEDLINE/PubMed
spelling pubmed-83693312021-08-17 Top-k dominating queries on incomplete large dataset Wu, Jimmy Ming-Tai Wei, Min Wu, Mu-En Tayeb, Shahab J Supercomput Article Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster. Springer US 2021-08-17 2022 /pmc/articles/PMC8369331/ /pubmed/34421217 http://dx.doi.org/10.1007/s11227-021-04005-x Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Wu, Jimmy Ming-Tai
Wei, Min
Wu, Mu-En
Tayeb, Shahab
Top-k dominating queries on incomplete large dataset
title Top-k dominating queries on incomplete large dataset
title_full Top-k dominating queries on incomplete large dataset
title_fullStr Top-k dominating queries on incomplete large dataset
title_full_unstemmed Top-k dominating queries on incomplete large dataset
title_short Top-k dominating queries on incomplete large dataset
title_sort top-k dominating queries on incomplete large dataset
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8369331/
https://www.ncbi.nlm.nih.gov/pubmed/34421217
http://dx.doi.org/10.1007/s11227-021-04005-x
work_keys_str_mv AT wujimmymingtai topkdominatingqueriesonincompletelargedataset
AT weimin topkdominatingqueriesonincompletelargedataset
AT wumuen topkdominatingqueriesonincompletelargedataset
AT tayebshahab topkdominatingqueriesonincompletelargedataset