Cargando…

Discovery and genotyping of novel sequence insertions in many sequenced individuals

MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational...

Descripción completa

Detalles Bibliográficos
Autores principales: Kavak, Pınar, Lin, Yen-Yi, Numanagić, Ibrahim, Asghari, Hossein, Güngör, Tunga, Alkan, Can, Hach, Faraz
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/
https://www.ncbi.nlm.nih.gov/pubmed/28881988
http://dx.doi.org/10.1093/bioinformatics/btx254
_version_ 1783309520069459968
author Kavak, Pınar
Lin, Yen-Yi
Numanagić, Ibrahim
Asghari, Hossein
Güngör, Tunga
Alkan, Can
Hach, Faraz
author_facet Kavak, Pınar
Lin, Yen-Yi
Numanagić, Ibrahim
Asghari, Hossein
Güngör, Tunga
Alkan, Can
Hach, Faraz
author_sort Kavak, Pınar
collection PubMed
description MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5870608
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58706082018-04-05 Discovery and genotyping of novel sequence insertions in many sequenced individuals Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870608/ /pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
Kavak, Pınar
Lin, Yen-Yi
Numanagić, Ibrahim
Asghari, Hossein
Güngör, Tunga
Alkan, Can
Hach, Faraz
Discovery and genotyping of novel sequence insertions in many sequenced individuals
title Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_full Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_fullStr Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_full_unstemmed Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_short Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_sort discovery and genotyping of novel sequence insertions in many sequenced individuals
topic Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/
https://www.ncbi.nlm.nih.gov/pubmed/28881988
http://dx.doi.org/10.1093/bioinformatics/btx254
work_keys_str_mv AT kavakpınar discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT linyenyi discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT numanagicibrahim discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT asgharihossein discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT gungortunga discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT alkancan discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals
AT hachfaraz discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals