Cargando…
Probabilistic base calling of Solexa sequencing data
BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair pro...
Autores principales: | , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2575221/ https://www.ncbi.nlm.nih.gov/pubmed/18851737 http://dx.doi.org/10.1186/1471-2105-9-431 |
_version_ | 1782160310045507584 |
---|---|
author | Rougemont, Jacques Amzallag, Arnaud Iseli, Christian Farinelli, Laurent Xenarios, Ioannis Naef, Felix |
author_facet | Rougemont, Jacques Amzallag, Arnaud Iseli, Christian Farinelli, Laurent Xenarios, Ioannis Naef, Felix |
author_sort | Rougemont, Jacques |
collection | PubMed |
description | BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots. |
format | Text |
id | pubmed-2575221 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-25752212008-10-29 Probabilistic base calling of Solexa sequencing data Rougemont, Jacques Amzallag, Arnaud Iseli, Christian Farinelli, Laurent Xenarios, Ioannis Naef, Felix BMC Bioinformatics Methodology Article BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots. BioMed Central 2008-10-13 /pmc/articles/PMC2575221/ /pubmed/18851737 http://dx.doi.org/10.1186/1471-2105-9-431 Text en Copyright © 2008 Rougemont et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Rougemont, Jacques Amzallag, Arnaud Iseli, Christian Farinelli, Laurent Xenarios, Ioannis Naef, Felix Probabilistic base calling of Solexa sequencing data |
title | Probabilistic base calling of Solexa sequencing data |
title_full | Probabilistic base calling of Solexa sequencing data |
title_fullStr | Probabilistic base calling of Solexa sequencing data |
title_full_unstemmed | Probabilistic base calling of Solexa sequencing data |
title_short | Probabilistic base calling of Solexa sequencing data |
title_sort | probabilistic base calling of solexa sequencing data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2575221/ https://www.ncbi.nlm.nih.gov/pubmed/18851737 http://dx.doi.org/10.1186/1471-2105-9-431 |
work_keys_str_mv | AT rougemontjacques probabilisticbasecallingofsolexasequencingdata AT amzallagarnaud probabilisticbasecallingofsolexasequencingdata AT iselichristian probabilisticbasecallingofsolexasequencingdata AT farinellilaurent probabilisticbasecallingofsolexasequencingdata AT xenariosioannis probabilisticbasecallingofsolexasequencingdata AT naeffelix probabilisticbasecallingofsolexasequencingdata |