Cargando…

Probabilistic base calling of Solexa sequencing data

BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair pro...

Descripción completa

Detalles Bibliográficos
Autores principales: Rougemont, Jacques, Amzallag, Arnaud, Iseli, Christian, Farinelli, Laurent, Xenarios, Ioannis, Naef, Felix
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2575221/
https://www.ncbi.nlm.nih.gov/pubmed/18851737
http://dx.doi.org/10.1186/1471-2105-9-431
_version_ 1782160310045507584
author Rougemont, Jacques
Amzallag, Arnaud
Iseli, Christian
Farinelli, Laurent
Xenarios, Ioannis
Naef, Felix
author_facet Rougemont, Jacques
Amzallag, Arnaud
Iseli, Christian
Farinelli, Laurent
Xenarios, Ioannis
Naef, Felix
author_sort Rougemont, Jacques
collection PubMed
description BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
format Text
id pubmed-2575221
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25752212008-10-29 Probabilistic base calling of Solexa sequencing data Rougemont, Jacques Amzallag, Arnaud Iseli, Christian Farinelli, Laurent Xenarios, Ioannis Naef, Felix BMC Bioinformatics Methodology Article BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots. BioMed Central 2008-10-13 /pmc/articles/PMC2575221/ /pubmed/18851737 http://dx.doi.org/10.1186/1471-2105-9-431 Text en Copyright © 2008 Rougemont et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Rougemont, Jacques
Amzallag, Arnaud
Iseli, Christian
Farinelli, Laurent
Xenarios, Ioannis
Naef, Felix
Probabilistic base calling of Solexa sequencing data
title Probabilistic base calling of Solexa sequencing data
title_full Probabilistic base calling of Solexa sequencing data
title_fullStr Probabilistic base calling of Solexa sequencing data
title_full_unstemmed Probabilistic base calling of Solexa sequencing data
title_short Probabilistic base calling of Solexa sequencing data
title_sort probabilistic base calling of solexa sequencing data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2575221/
https://www.ncbi.nlm.nih.gov/pubmed/18851737
http://dx.doi.org/10.1186/1471-2105-9-431
work_keys_str_mv AT rougemontjacques probabilisticbasecallingofsolexasequencingdata
AT amzallagarnaud probabilisticbasecallingofsolexasequencingdata
AT iselichristian probabilisticbasecallingofsolexasequencingdata
AT farinellilaurent probabilisticbasecallingofsolexasequencingdata
AT xenariosioannis probabilisticbasecallingofsolexasequencingdata
AT naeffelix probabilisticbasecallingofsolexasequencingdata