Cargando…

Reference-Free Validation of Short Read Data

BACKGROUND: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of so...

Descripción completa

Detalles Bibliográficos
Autores principales: Schröder, Jan, Bailey, James, Conway, Thomas, Zobel, Justin
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943903/
https://www.ncbi.nlm.nih.gov/pubmed/20877643
http://dx.doi.org/10.1371/journal.pone.0012681
_version_ 1782187054747090944
author Schröder, Jan
Bailey, James
Conway, Thomas
Zobel, Justin
author_facet Schröder, Jan
Bailey, James
Conway, Thomas
Zobel, Justin
author_sort Schröder, Jan
collection PubMed
description BACKGROUND: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked. RESULTS: We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others. CONCLUSIONS: The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.
format Text
id pubmed-2943903
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-29439032010-09-28 Reference-Free Validation of Short Read Data Schröder, Jan Bailey, James Conway, Thomas Zobel, Justin PLoS One Research Article BACKGROUND: High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked. RESULTS: We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others. CONCLUSIONS: The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification. Public Library of Science 2010-09-22 /pmc/articles/PMC2943903/ /pubmed/20877643 http://dx.doi.org/10.1371/journal.pone.0012681 Text en Schröder et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Schröder, Jan
Bailey, James
Conway, Thomas
Zobel, Justin
Reference-Free Validation of Short Read Data
title Reference-Free Validation of Short Read Data
title_full Reference-Free Validation of Short Read Data
title_fullStr Reference-Free Validation of Short Read Data
title_full_unstemmed Reference-Free Validation of Short Read Data
title_short Reference-Free Validation of Short Read Data
title_sort reference-free validation of short read data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943903/
https://www.ncbi.nlm.nih.gov/pubmed/20877643
http://dx.doi.org/10.1371/journal.pone.0012681
work_keys_str_mv AT schroderjan referencefreevalidationofshortreaddata
AT baileyjames referencefreevalidationofshortreaddata
AT conwaythomas referencefreevalidationofshortreaddata
AT zobeljustin referencefreevalidationofshortreaddata