Cargando…

SAMQA: error classification and validation of high-throughput sequenced read data

BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this pape...

Descripción completa

Detalles Bibliográficos
Autores principales:	Robinson, Thomas, Killcoyne, Sarah, Bressler, Ryan, Boyle, John
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/ https://www.ncbi.nlm.nih.gov/pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419

_version_	1782211609019547648
author	Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John
author_facet	Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John
author_sort	Robinson, Thomas
collection	PubMed
description	BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.
format	Online Article Text
id	pubmed-3170309
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31703092011-09-10 SAMQA: error classification and validation of high-throughput sequenced read data Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John BMC Genomics Software BACKGROUND: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data. RESULTS: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server. CONCLUSIONS: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type. BioMed Central 2011-08-18 /pmc/articles/PMC3170309/ /pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419 Text en Copyright ©2011 Robinson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Robinson, Thomas Killcoyne, Sarah Bressler, Ryan Boyle, John SAMQA: error classification and validation of high-throughput sequenced read data
title	SAMQA: error classification and validation of high-throughput sequenced read data
title_full	SAMQA: error classification and validation of high-throughput sequenced read data
title_fullStr	SAMQA: error classification and validation of high-throughput sequenced read data
title_full_unstemmed	SAMQA: error classification and validation of high-throughput sequenced read data
title_short	SAMQA: error classification and validation of high-throughput sequenced read data
title_sort	samqa: error classification and validation of high-throughput sequenced read data
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309/ https://www.ncbi.nlm.nih.gov/pubmed/21851633 http://dx.doi.org/10.1186/1471-2164-12-419
work_keys_str_mv	AT robinsonthomas samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT killcoynesarah samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT bresslerryan samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata AT boylejohn samqaerrorclassificationandvalidationofhighthroughputsequencedreaddata

SAMQA: error classification and validation of high-throughput sequenced read data

Ejemplares similares