Cargando…

Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads

The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight’s half brother Bravo. Since t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rebolledo-Mendez, Jovan, Hestand, Matthew S., Coleman, Stephen J., Zeng, Zheng, Orlando, Ludovic, MacLeod, James N., Kalbfleisch, Ted
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479572/ https://www.ncbi.nlm.nih.gov/pubmed/26107638 http://dx.doi.org/10.1371/journal.pone.0126852

_version_	1782378033285431296
author	Rebolledo-Mendez, Jovan Hestand, Matthew S. Coleman, Stephen J. Zeng, Zheng Orlando, Ludovic MacLeod, James N. Kalbfleisch, Ted
author_facet	Rebolledo-Mendez, Jovan Hestand, Matthew S. Coleman, Stephen J. Zeng, Zheng Orlando, Ludovic MacLeod, James N. Kalbfleisch, Ted
author_sort	Rebolledo-Mendez, Jovan
collection	PubMed
description	The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight’s half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects’ and Twilight’s genome or due to errors in the reference. EquCab2 is regarded as “The Twilight Assembly.” The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments.
format	Online Article Text
id	pubmed-4479572
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-44795722015-06-29 Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads Rebolledo-Mendez, Jovan Hestand, Matthew S. Coleman, Stephen J. Zeng, Zheng Orlando, Ludovic MacLeod, James N. Kalbfleisch, Ted PLoS One Research Article The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight’s half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects’ and Twilight’s genome or due to errors in the reference. EquCab2 is regarded as “The Twilight Assembly.” The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments. Public Library of Science 2015-06-24 /pmc/articles/PMC4479572/ /pubmed/26107638 http://dx.doi.org/10.1371/journal.pone.0126852 Text en © 2015 Rebolledo-Mendez et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Rebolledo-Mendez, Jovan Hestand, Matthew S. Coleman, Stephen J. Zeng, Zheng Orlando, Ludovic MacLeod, James N. Kalbfleisch, Ted Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title	Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title_full	Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title_fullStr	Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title_full_unstemmed	Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title_short	Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
title_sort	comparison of the equine reference sequence with its sanger source data and new illumina reads
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479572/ https://www.ncbi.nlm.nih.gov/pubmed/26107638 http://dx.doi.org/10.1371/journal.pone.0126852
work_keys_str_mv	AT rebolledomendezjovan comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT hestandmatthews comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT colemanstephenj comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT zengzheng comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT orlandoludovic comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT macleodjamesn comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads AT kalbfleischted comparisonoftheequinereferencesequencewithitssangersourcedataandnewilluminareads

Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads

Ejemplares similares