Cargando…

Read clouds uncover variation in complex regions of the human genome

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Co...

Descripción completa

Detalles Bibliográficos
Autores principales: Bishara, Alex, Liu, Yuling, Weng, Ziming, Kashef-Haghighi, Dorna, Newburger, Daniel E., West, Robert, Sidow, Arend, Batzoglou, Serafim
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4579342/
https://www.ncbi.nlm.nih.gov/pubmed/26286554
http://dx.doi.org/10.1101/gr.191189.115
_version_ 1782391252174503936
author Bishara, Alex
Liu, Yuling
Weng, Ziming
Kashef-Haghighi, Dorna
Newburger, Daniel E.
West, Robert
Sidow, Arend
Batzoglou, Serafim
author_facet Bishara, Alex
Liu, Yuling
Weng, Ziming
Kashef-Haghighi, Dorna
Newburger, Daniel E.
West, Robert
Sidow, Arend
Batzoglou, Serafim
author_sort Bishara, Alex
collection PubMed
description Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.
format Online
Article
Text
id pubmed-4579342
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-45793422015-10-01 Read clouds uncover variation in complex regions of the human genome Bishara, Alex Liu, Yuling Weng, Ziming Kashef-Haghighi, Dorna Newburger, Daniel E. West, Robert Sidow, Arend Batzoglou, Serafim Genome Res Method Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies. Cold Spring Harbor Laboratory Press 2015-10 /pmc/articles/PMC4579342/ /pubmed/26286554 http://dx.doi.org/10.1101/gr.191189.115 Text en © 2015 Bishara et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.
spellingShingle Method
Bishara, Alex
Liu, Yuling
Weng, Ziming
Kashef-Haghighi, Dorna
Newburger, Daniel E.
West, Robert
Sidow, Arend
Batzoglou, Serafim
Read clouds uncover variation in complex regions of the human genome
title Read clouds uncover variation in complex regions of the human genome
title_full Read clouds uncover variation in complex regions of the human genome
title_fullStr Read clouds uncover variation in complex regions of the human genome
title_full_unstemmed Read clouds uncover variation in complex regions of the human genome
title_short Read clouds uncover variation in complex regions of the human genome
title_sort read clouds uncover variation in complex regions of the human genome
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4579342/
https://www.ncbi.nlm.nih.gov/pubmed/26286554
http://dx.doi.org/10.1101/gr.191189.115
work_keys_str_mv AT bisharaalex readcloudsuncovervariationincomplexregionsofthehumangenome
AT liuyuling readcloudsuncovervariationincomplexregionsofthehumangenome
AT wengziming readcloudsuncovervariationincomplexregionsofthehumangenome
AT kashefhaghighidorna readcloudsuncovervariationincomplexregionsofthehumangenome
AT newburgerdaniele readcloudsuncovervariationincomplexregionsofthehumangenome
AT westrobert readcloudsuncovervariationincomplexregionsofthehumangenome
AT sidowarend readcloudsuncovervariationincomplexregionsofthehumangenome
AT batzoglouserafim readcloudsuncovervariationincomplexregionsofthehumangenome