Cargando…

Accurate indel prediction using paired-end short reads

BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an...

Descripción completa

Detalles Bibliográficos
Autores principales: Grimm, Dominik, Hagmann, Jörg, Koenig, Daniel, Weigel, Detlef, Borgwardt, Karsten
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614465/
https://www.ncbi.nlm.nih.gov/pubmed/23442375
http://dx.doi.org/10.1186/1471-2164-14-132
_version_ 1782264845643546624
author Grimm, Dominik
Hagmann, Jörg
Koenig, Daniel
Weigel, Detlef
Borgwardt, Karsten
author_facet Grimm, Dominik
Hagmann, Jörg
Koenig, Daniel
Weigel, Detlef
Borgwardt, Karsten
author_sort Grimm, Dominik
collection PubMed
description BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. RESULTS: Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana. CONCLUSION: In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
format Online
Article
Text
id pubmed-3614465
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36144652013-04-05 Accurate indel prediction using paired-end short reads Grimm, Dominik Hagmann, Jörg Koenig, Daniel Weigel, Detlef Borgwardt, Karsten BMC Genomics Methodology Article BACKGROUND: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. RESULTS: Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana. CONCLUSION: In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/. BioMed Central 2013-02-27 /pmc/articles/PMC3614465/ /pubmed/23442375 http://dx.doi.org/10.1186/1471-2164-14-132 Text en Copyright © 2013 Grimm et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Grimm, Dominik
Hagmann, Jörg
Koenig, Daniel
Weigel, Detlef
Borgwardt, Karsten
Accurate indel prediction using paired-end short reads
title Accurate indel prediction using paired-end short reads
title_full Accurate indel prediction using paired-end short reads
title_fullStr Accurate indel prediction using paired-end short reads
title_full_unstemmed Accurate indel prediction using paired-end short reads
title_short Accurate indel prediction using paired-end short reads
title_sort accurate indel prediction using paired-end short reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614465/
https://www.ncbi.nlm.nih.gov/pubmed/23442375
http://dx.doi.org/10.1186/1471-2164-14-132
work_keys_str_mv AT grimmdominik accurateindelpredictionusingpairedendshortreads
AT hagmannjorg accurateindelpredictionusingpairedendshortreads
AT koenigdaniel accurateindelpredictionusingpairedendshortreads
AT weigeldetlef accurateindelpredictionusingpairedendshortreads
AT borgwardtkarsten accurateindelpredictionusingpairedendshortreads