Cargando…
Identification of genomic indels and structural variations using split reads
BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161018/ https://www.ncbi.nlm.nih.gov/pubmed/21787423 http://dx.doi.org/10.1186/1471-2164-12-375 |
_version_ | 1782210625474134016 |
---|---|
author | Zhang, Zhengdong D Du, Jiang Lam, Hugo Abyzov, Alex Urban, Alexander E Snyder, Michael Gerstein, Mark |
author_facet | Zhang, Zhengdong D Du, Jiang Lam, Hugo Abyzov, Alex Urban, Alexander E Snyder, Michael Gerstein, Mark |
author_sort | Zhang, Zhengdong D |
collection | PubMed |
description | BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. RESULTS: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. CONCLUSIONS: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. |
format | Online Article Text |
id | pubmed-3161018 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-31610182011-08-25 Identification of genomic indels and structural variations using split reads Zhang, Zhengdong D Du, Jiang Lam, Hugo Abyzov, Alex Urban, Alexander E Snyder, Michael Gerstein, Mark BMC Genomics Methodology Article BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. RESULTS: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. CONCLUSIONS: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. BioMed Central 2011-07-25 /pmc/articles/PMC3161018/ /pubmed/21787423 http://dx.doi.org/10.1186/1471-2164-12-375 Text en Copyright ©2011 Zhang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Zhang, Zhengdong D Du, Jiang Lam, Hugo Abyzov, Alex Urban, Alexander E Snyder, Michael Gerstein, Mark Identification of genomic indels and structural variations using split reads |
title | Identification of genomic indels and structural variations using split reads |
title_full | Identification of genomic indels and structural variations using split reads |
title_fullStr | Identification of genomic indels and structural variations using split reads |
title_full_unstemmed | Identification of genomic indels and structural variations using split reads |
title_short | Identification of genomic indels and structural variations using split reads |
title_sort | identification of genomic indels and structural variations using split reads |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3161018/ https://www.ncbi.nlm.nih.gov/pubmed/21787423 http://dx.doi.org/10.1186/1471-2164-12-375 |
work_keys_str_mv | AT zhangzhengdongd identificationofgenomicindelsandstructuralvariationsusingsplitreads AT dujiang identificationofgenomicindelsandstructuralvariationsusingsplitreads AT lamhugo identificationofgenomicindelsandstructuralvariationsusingsplitreads AT abyzovalex identificationofgenomicindelsandstructuralvariationsusingsplitreads AT urbanalexandere identificationofgenomicindelsandstructuralvariationsusingsplitreads AT snydermichael identificationofgenomicindelsandstructuralvariationsusingsplitreads AT gersteinmark identificationofgenomicindelsandstructuralvariationsusingsplitreads |