Cargando…
ALE: automated label extraction from GEO metadata
BACKGROUND: NCBI’s Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual descri...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751806/ https://www.ncbi.nlm.nih.gov/pubmed/29297276 http://dx.doi.org/10.1186/s12859-017-1888-1 |
_version_ | 1783290022960562176 |
---|---|
author | Giles, Cory B. Brown, Chase A. Ripperger, Michael Dennis, Zane Roopnarinesingh, Xiavan Porter, Hunter Perz, Aleksandra Wren, Jonathan D. |
author_facet | Giles, Cory B. Brown, Chase A. Ripperger, Michael Dennis, Zane Roopnarinesingh, Xiavan Porter, Hunter Perz, Aleksandra Wren, Jonathan D. |
author_sort | Giles, Cory B. |
collection | PubMed |
description | BACKGROUND: NCBI’s Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. RESULTS: Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. CONCLUSION: Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples. |
format | Online Article Text |
id | pubmed-5751806 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-57518062018-01-05 ALE: automated label extraction from GEO metadata Giles, Cory B. Brown, Chase A. Ripperger, Michael Dennis, Zane Roopnarinesingh, Xiavan Porter, Hunter Perz, Aleksandra Wren, Jonathan D. BMC Bioinformatics Research BACKGROUND: NCBI’s Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. RESULTS: Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. CONCLUSION: Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples. BioMed Central 2017-12-28 /pmc/articles/PMC5751806/ /pubmed/29297276 http://dx.doi.org/10.1186/s12859-017-1888-1 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Giles, Cory B. Brown, Chase A. Ripperger, Michael Dennis, Zane Roopnarinesingh, Xiavan Porter, Hunter Perz, Aleksandra Wren, Jonathan D. ALE: automated label extraction from GEO metadata |
title | ALE: automated label extraction from GEO metadata |
title_full | ALE: automated label extraction from GEO metadata |
title_fullStr | ALE: automated label extraction from GEO metadata |
title_full_unstemmed | ALE: automated label extraction from GEO metadata |
title_short | ALE: automated label extraction from GEO metadata |
title_sort | ale: automated label extraction from geo metadata |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751806/ https://www.ncbi.nlm.nih.gov/pubmed/29297276 http://dx.doi.org/10.1186/s12859-017-1888-1 |
work_keys_str_mv | AT gilescoryb aleautomatedlabelextractionfromgeometadata AT brownchasea aleautomatedlabelextractionfromgeometadata AT rippergermichael aleautomatedlabelextractionfromgeometadata AT denniszane aleautomatedlabelextractionfromgeometadata AT roopnarinesinghxiavan aleautomatedlabelextractionfromgeometadata AT porterhunter aleautomatedlabelextractionfromgeometadata AT perzaleksandra aleautomatedlabelextractionfromgeometadata AT wrenjonathand aleautomatedlabelextractionfromgeometadata |