Cargando…

SAMQA: error classification and validation of high-throughput sequenced read data

BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this pape...

Descripción completa

Detalles Bibliográficos
Autores principales: Robinson, Thomas, Killcoyne, Sarah, Bressler, Ryan, Boyle, John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/
https://www.ncbi.nlm.nih.gov/pubmed/21851633
http://dx.doi.org/10.1186/1471-2164-12-419
_version_ 1782211609019547648
author Robinson, Thomas
Killcoyne, Sarah
Bressler, Ryan
Boyle, John
author_facet Robinson, Thomas
Killcoyne, Sarah
Bressler, Ryan
Boyle, John
author_sort Robinson, Thomas
collection PubMed
description BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.
format Online
Article
Text
id pubmed-3170309
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31703092011-09-10 SAMQA: error classification and validation of high-throughput sequenced read data Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John BMC Genomics Software BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type. BioMed Central 2011-08-18 /pmc/articles/PMC3170309/ /pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419 Text en Copyright ©2011 Robinson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Robinson, Thomas
Killcoyne, Sarah
Bressler, Ryan
Boyle, John
SAMQA: error classification and validation of high-throughput sequenced read data
title SAMQA: error classification and validation of high-throughput sequenced read data
title_full SAMQA: error classification and validation of high-throughput sequenced read data
title_fullStr SAMQA: error classification and validation of high-throughput sequenced read data
title_full_unstemmed SAMQA: error classification and validation of high-throughput sequenced read data
title_short SAMQA: error classification and validation of high-throughput sequenced read data
title_sort samqa: error classification and validation of high-throughput sequenced read data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/
https://www.ncbi.nlm.nih.gov/pubmed/21851633
http://dx.doi.org/10.1186/1471-2164-12-419
work_keys_str_mv AT robinsonthomas samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata
AT killcoynesarah samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata
AT bresslerryan samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata
AT boylejohn samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata