Cargando…
Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata
BACKGROUND: The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5604298/ https://www.ncbi.nlm.nih.gov/pubmed/28923003 http://dx.doi.org/10.1186/s12859-017-1832-4 |
_version_ | 1783264842021339136 |
---|---|
author | Hu, Wei Zaveri, Amrapali Qiu, Honglei Dumontier, Michel |
author_facet | Hu, Wei Zaveri, Amrapali Qiu, Honglei Dumontier, Michel |
author_sort | Hu, Wei |
collection | PubMed |
description | BACKGROUND: The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. METHODS: In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. RESULTS: Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). CONCLUSION: Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types. |
format | Online Article Text |
id | pubmed-5604298 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-56042982017-09-21 Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata Hu, Wei Zaveri, Amrapali Qiu, Honglei Dumontier, Michel BMC Bioinformatics Methodology Article BACKGROUND: The ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data. METHODS: In this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together. RESULTS: Our agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63). CONCLUSION: Our algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types. BioMed Central 2017-09-18 /pmc/articles/PMC5604298/ /pubmed/28923003 http://dx.doi.org/10.1186/s12859-017-1832-4 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Hu, Wei Zaveri, Amrapali Qiu, Honglei Dumontier, Michel Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_full | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_fullStr | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_full_unstemmed | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_short | Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
title_sort | cleaning by clustering: methodology for addressing data quality issues in biomedical metadata |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5604298/ https://www.ncbi.nlm.nih.gov/pubmed/28923003 http://dx.doi.org/10.1186/s12859-017-1832-4 |
work_keys_str_mv | AT huwei cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT zaveriamrapali cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT qiuhonglei cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata AT dumontiermichel cleaningbyclusteringmethodologyforaddressingdataqualityissuesinbiomedicalmetadata |