Cargando…

Discovery and genotyping of novel sequence insertions in many sequenced individuals

MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kavak, Pınar, Lin, Yen-Yi, Numanagić, Ibrahim, Asghari, Hossein, Güngör, Tunga, Alkan, Can, Hach, Faraz
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2017
Materias:	Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/ https://www.ncbi.nlm.nih.gov/pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254

_version_	1783309520069459968
author	Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz
author_facet	Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz
author_sort	Kavak, Pınar
collection	PubMed
description	MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-5870608
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-58706082018-04-05 Discovery and genotyping of novel sequence insertions in many sequenced individuals Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870608/ /pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz Discovery and genotyping of novel sequence insertions in many sequenced individuals
title	Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_full	Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_fullStr	Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_full_unstemmed	Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_short	Discovery and genotyping of novel sequence insertions in many sequenced individuals
title_sort	discovery and genotyping of novel sequence insertions in many sequenced individuals
topic	Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/ https://www.ncbi.nlm.nih.gov/pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254
work_keys_str_mv	AT kavakpınar discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT linyenyi discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT numanagicibrahim discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT asgharihossein discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT gungortunga discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT alkancan discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT hachfaraz discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals

Discovery and genotyping of novel sequence insertions in many sequenced individuals

Ejemplares similares