Cargando…

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive

MOTIVATION: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the...

Descripción completa

Detalles Bibliográficos
Autores principales: Bernstein, Matthew N, Doan, AnHai, Dewey, Colin N
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870770/
https://www.ncbi.nlm.nih.gov/pubmed/28535296
http://dx.doi.org/10.1093/bioinformatics/btx334
_version_ 1783309546605772800
author Bernstein, Matthew N
Doan, AnHai
Dewey, Colin N
author_facet Bernstein, Matthew N
Doan, AnHai
Dewey, Colin N
author_sort Bernstein, Matthew N
collection PubMed
description MOTIVATION: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. RESULTS: We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. AVAILABILITY AND IMPLEMENTATION: The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5870770
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58707702018-03-29 MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive Bernstein, Matthew N Doan, AnHai Dewey, Colin N Bioinformatics Original Papers MOTIVATION: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. RESULTS: We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. AVAILABILITY AND IMPLEMENTATION: The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-09-15 2017-05-23 /pmc/articles/PMC5870770/ /pubmed/28535296 http://dx.doi.org/10.1093/bioinformatics/btx334 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Bernstein, Matthew N
Doan, AnHai
Dewey, Colin N
MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title_full MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title_fullStr MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title_full_unstemmed MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title_short MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
title_sort metasra: normalized human sample-specific metadata for the sequence read archive
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870770/
https://www.ncbi.nlm.nih.gov/pubmed/28535296
http://dx.doi.org/10.1093/bioinformatics/btx334
work_keys_str_mv AT bernsteinmatthewn metasranormalizedhumansamplespecificmetadataforthesequencereadarchive
AT doananhai metasranormalizedhumansamplespecificmetadataforthesequencereadarchive
AT deweycolinn metasranormalizedhumansamplespecificmetadataforthesequencereadarchive