Cargando…

Identification of genomic indels and structural variations using split reads

BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Zhengdong D, Du, Jiang, Lam, Hugo, Abyzov, Alex, Urban, Alexander E, Snyder, Michael, Gerstein, Mark
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161018/
https://www.ncbi.nlm.nih.gov/pubmed/21787423
http://dx.doi.org/10.1186/1471-2164-12-375
_version_ 1782210625474134016
author Zhang, Zhengdong D
Du, Jiang
Lam, Hugo
Abyzov, Alex
Urban, Alexander E
Snyder, Michael
Gerstein, Mark
author_facet Zhang, Zhengdong D
Du, Jiang
Lam, Hugo
Abyzov, Alex
Urban, Alexander E
Snyder, Michael
Gerstein, Mark
author_sort Zhang, Zhengdong D
collection PubMed
description BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. RESULTS: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. CONCLUSIONS: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.
format Online
Article
Text
id pubmed-3161018
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31610182011-08-25 Identification of genomic indels and structural variations using split reads Zhang, Zhengdong D Du, Jiang Lam, Hugo Abyzov, Alex Urban, Alexander E Snyder, Michael Gerstein, Mark BMC Genomics Methodology Article BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. RESULTS: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. CONCLUSIONS: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. BioMed Central 2011-07-25 /pmc/articles/PMC3161018/ /pubmed/21787423 http://dx.doi.org/10.1186/1471-2164-12-375 Text en Copyright ©2011 Zhang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Zhang, Zhengdong D
Du, Jiang
Lam, Hugo
Abyzov, Alex
Urban, Alexander E
Snyder, Michael
Gerstein, Mark
Identification of genomic indels and structural variations using split reads
title Identification of genomic indels and structural variations using split reads
title_full Identification of genomic indels and structural variations using split reads
title_fullStr Identification of genomic indels and structural variations using split reads
title_full_unstemmed Identification of genomic indels and structural variations using split reads
title_short Identification of genomic indels and structural variations using split reads
title_sort identification of genomic indels and structural variations using split reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161018/
https://www.ncbi.nlm.nih.gov/pubmed/21787423
http://dx.doi.org/10.1186/1471-2164-12-375
work_keys_str_mv AT zhangzhengdongd identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT dujiang identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT lamhugo identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT abyzovalex identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT urbanalexandere identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT snydermichael identificationofgenomicindelsandstructuralvariationsusingsplitreads
AT gersteinmark identificationofgenomicindelsandstructuralvariationsusingsplitreads