Cargando…
Accurate indel prediction using paired-end short reads
BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614465/ https://www.ncbi.nlm.nih.gov/pubmed/23442375 http://dx.doi.org/10.1186/1471-2164-14-132 |
_version_ | 1782264845643546624 |
---|---|
author | Grimm, Dominik Hagmann, Jörg Koenig, Daniel Weigel, Detlef Borgwardt, Karsten |
author_facet | Grimm, Dominik Hagmann, Jörg Koenig, Daniel Weigel, Detlef Borgwardt, Karsten |
author_sort | Grimm, Dominik |
collection | PubMed |
description | BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. RESULTS: Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana. CONCLUSION: In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/. |
format | Online Article Text |
id | pubmed-3614465 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-36144652013-04-05 Accurate indel prediction using paired-end short reads Grimm, Dominik Hagmann, Jörg Koenig, Daniel Weigel, Detlef Borgwardt, Karsten BMC Genomics Methodology Article BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. RESULTS: Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana. CONCLUSION: In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/. BioMed Central 2013-02-27 /pmc/articles/PMC3614465/ /pubmed/23442375 http://dx.doi.org/10.1186/1471-2164-14-132 Text en Copyright © 2013 Grimm et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Grimm, Dominik Hagmann, Jörg Koenig, Daniel Weigel, Detlef Borgwardt, Karsten Accurate indel prediction using paired-end short reads |
title | Accurate indel prediction using paired-end short reads |
title_full | Accurate indel prediction using paired-end short reads |
title_fullStr | Accurate indel prediction using paired-end short reads |
title_full_unstemmed | Accurate indel prediction using paired-end short reads |
title_short | Accurate indel prediction using paired-end short reads |
title_sort | accurate indel prediction using paired-end short reads |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614465/ https://www.ncbi.nlm.nih.gov/pubmed/23442375 http://dx.doi.org/10.1186/1471-2164-14-132 |
work_keys_str_mv | AT grimmdominik accurateindelpredictionusingpairedendshortreads AT hagmannjorg accurateindelpredictionusingpairedendshortreads AT koenigdaniel accurateindelpredictionusingpairedendshortreads AT weigeldetlef accurateindelpredictionusingpairedendshortreads AT borgwardtkarsten accurateindelpredictionusingpairedendshortreads |