Cargando…
SAMQA: error classification and validation of high-throughput sequenced read data
BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this pape...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/ https://www.ncbi.nlm.nih.gov/pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419 |
_version_ | 1782211609019547648 |
---|---|
author | Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John |
author_facet | Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John |
author_sort | Robinson, Thomas |
collection | PubMed |
description | BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type. |
format | Online Article Text |
id | pubmed-3170309 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-31703092011-09-10 SAMQA: error classification and validation of high-throughput sequenced read data Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John BMC Genomics Software BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type. BioMed Central 2011-08-18 /pmc/articles/PMC3170309/ /pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419 Text en Copyright ©2011 Robinson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John SAMQA: error classification and validation of high-throughput sequenced read data |
title | SAMQA: error classification and validation of high-throughput sequenced read data |
title_full | SAMQA: error classification and validation of high-throughput sequenced read data |
title_fullStr | SAMQA: error classification and validation of high-throughput sequenced read data |
title_full_unstemmed | SAMQA: error classification and validation of high-throughput sequenced read data |
title_short | SAMQA: error classification and validation of high-throughput sequenced read data |
title_sort | samqa: error classification and validation of high-throughput sequenced read data |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/ https://www.ncbi.nlm.nih.gov/pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419 |
work_keys_str_mv | AT robinsonthomas samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT killcoynesarah samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT bresslerryan samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT boylejohn samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata |