Cargando…
Discovery and genotyping of novel sequence insertions in many sequenced individuals
MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/ https://www.ncbi.nlm.nih.gov/pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254 |
_version_ | 1783309520069459968 |
---|---|
author | Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz |
author_facet | Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz |
author_sort | Kavak, Pınar |
collection | PubMed |
description | MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-5870608 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-58706082018-04-05 Discovery and genotyping of novel sequence insertions in many sequenced individuals Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sample data to “collectively” obtain a very high coverage dataset to accurately find insertions common in a given population. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870608/ /pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 Kavak, Pınar Lin, Yen-Yi Numanagić, Ibrahim Asghari, Hossein Güngör, Tunga Alkan, Can Hach, Faraz Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title | Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title_full | Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title_fullStr | Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title_full_unstemmed | Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title_short | Discovery and genotyping of novel sequence insertions in many sequenced individuals |
title_sort | discovery and genotyping of novel sequence insertions in many sequenced individuals |
topic | Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870608/ https://www.ncbi.nlm.nih.gov/pubmed/28881988 http://dx.doi.org/10.1093/bioinformatics/btx254 |
work_keys_str_mv | AT kavakpınar discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT linyenyi discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT numanagicibrahim discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT asgharihossein discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT gungortunga discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT alkancan discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals AT hachfaraz discoveryandgenotypingofnovelsequenceinsertionsinmanysequencedindividuals |