Cargando…

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the l...

Descripción completa

Detalles Bibliográficos
Autores principales: Ohta, Tazro, Nakazato, Takeru, Bono, Hidemasa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5459929/
https://www.ncbi.nlm.nih.gov/pubmed/28449062
http://dx.doi.org/10.1093/gigascience/gix029
_version_ 1783242056512045056
author Ohta, Tazro
Nakazato, Takeru
Bono, Hidemasa
author_facet Ohta, Tazro
Nakazato, Takeru
Bono, Hidemasa
author_sort Ohta, Tazro
collection PubMed
description It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.
format Online
Article
Text
id pubmed-5459929
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54599292017-07-31 Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive Ohta, Tazro Nakazato, Takeru Bono, Hidemasa Gigascience Research It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party. Oxford University Press 2017-04-25 /pmc/articles/PMC5459929/ /pubmed/28449062 http://dx.doi.org/10.1093/gigascience/gix029 Text en © The Authors 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Ohta, Tazro
Nakazato, Takeru
Bono, Hidemasa
Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title_full Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title_fullStr Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title_full_unstemmed Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title_short Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
title_sort calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the sequence read archive
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5459929/
https://www.ncbi.nlm.nih.gov/pubmed/28449062
http://dx.doi.org/10.1093/gigascience/gix029
work_keys_str_mv AT ohtatazro calculatingthequalityofpublichighthroughputsequencingdatatoobtainasuitablesubsetforreanalysisfromthesequencereadarchive
AT nakazatotakeru calculatingthequalityofpublichighthroughputsequencingdatatoobtainasuitablesubsetforreanalysisfromthesequencereadarchive
AT bonohidemasa calculatingthequalityofpublichighthroughputsequencingdatatoobtainasuitablesubsetforreanalysisfromthesequencereadarchive