Cargando…

Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)

Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number...

Descripción completa

Detalles Bibliográficos
Autores principales: Warr, Amanda, Robert, Christelle, Hume, David, Archibald, Alan L., Deeb, Nader, Watson, Mick
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4662242/
https://www.ncbi.nlm.nih.gov/pubmed/26640477
http://dx.doi.org/10.3389/fgene.2015.00338
_version_ 1782403134807605248
author Warr, Amanda
Robert, Christelle
Hume, David
Archibald, Alan L.
Deeb, Nader
Watson, Mick
author_facet Warr, Amanda
Robert, Christelle
Hume, David
Archibald, Alan L.
Deeb, Nader
Watson, Mick
author_sort Warr, Amanda
collection PubMed
description Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number of false-positives. While resources are available to aid in the filtering of variants from human data, for other species these do not yet exist and strict filtering techniques must be employed which are more likely to exclude true-positives. This work assesses the accuracy of the pig reference genome (Sscrofa10.2) using whole genome sequencing reads from the Duroc sow whose genome the assembly was based on. Indicators of structural variation including high regional coverage, unexpected insert sizes, improper pairing and homozygous variants were used to identify low quality (LQ) regions of the assembly. Low coverage (LC) regions were also identified and analyzed separately. The LQ regions covered 13.85% of the genome, the LC regions covered 26.6% of the genome and combined (LQLC) they covered 33.07% of the genome. Over half of dbSNP variants were located in the LQLC regions. Of copy number variable regions identified in a previous study, 86.3% were located in the LQLC regions. The regions were also enriched for gene predictions from RNA-seq data with 42.98% falling in the LQLC regions. Excluding variants in the LQ, LC, or LQLC from future analyses will help reduce the number of false-positive variant calls. Researchers using WGS data should be aware that the current pig reference genome does not give an accurate representation of the copy number of alleles in the original Duroc sow’s genome.
format Online
Article
Text
id pubmed-4662242
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-46622422015-12-04 Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2) Warr, Amanda Robert, Christelle Hume, David Archibald, Alan L. Deeb, Nader Watson, Mick Front Genet Genetics Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number of false-positives. While resources are available to aid in the filtering of variants from human data, for other species these do not yet exist and strict filtering techniques must be employed which are more likely to exclude true-positives. This work assesses the accuracy of the pig reference genome (Sscrofa10.2) using whole genome sequencing reads from the Duroc sow whose genome the assembly was based on. Indicators of structural variation including high regional coverage, unexpected insert sizes, improper pairing and homozygous variants were used to identify low quality (LQ) regions of the assembly. Low coverage (LC) regions were also identified and analyzed separately. The LQ regions covered 13.85% of the genome, the LC regions covered 26.6% of the genome and combined (LQLC) they covered 33.07% of the genome. Over half of dbSNP variants were located in the LQLC regions. Of copy number variable regions identified in a previous study, 86.3% were located in the LQLC regions. The regions were also enriched for gene predictions from RNA-seq data with 42.98% falling in the LQLC regions. Excluding variants in the LQ, LC, or LQLC from future analyses will help reduce the number of false-positive variant calls. Researchers using WGS data should be aware that the current pig reference genome does not give an accurate representation of the copy number of alleles in the original Duroc sow’s genome. Frontiers Media S.A. 2015-11-27 /pmc/articles/PMC4662242/ /pubmed/26640477 http://dx.doi.org/10.3389/fgene.2015.00338 Text en Copyright © 2015 Warr, Robert, Hume, Archibald, Deeb and Watson. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Warr, Amanda
Robert, Christelle
Hume, David
Archibald, Alan L.
Deeb, Nader
Watson, Mick
Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title_full Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title_fullStr Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title_full_unstemmed Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title_short Identification of Low-Confidence Regions in the Pig Reference Genome (Sscrofa10.2)
title_sort identification of low-confidence regions in the pig reference genome (sscrofa10.2)
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4662242/
https://www.ncbi.nlm.nih.gov/pubmed/26640477
http://dx.doi.org/10.3389/fgene.2015.00338
work_keys_str_mv AT warramanda identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102
AT robertchristelle identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102
AT humedavid identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102
AT archibaldalanl identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102
AT deebnader identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102
AT watsonmick identificationoflowconfidenceregionsinthepigreferencegenomesscrofa102