Cargando…
Recurrent miscalling of missense variation from short-read genome sequence data
BACKGROUND: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate i...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6631443/ https://www.ncbi.nlm.nih.gov/pubmed/31307400 http://dx.doi.org/10.1186/s12864-019-5863-2 |
_version_ | 1783435518008098816 |
---|---|
author | Field, Matthew A. Burgio, Gaetan Chuah, Aaron Al Shekaili, Jalila Hassan, Batool Al Sukaiti, Nashat Foote, Simon J. Cook, Matthew C. Andrews, T. Daniel |
author_facet | Field, Matthew A. Burgio, Gaetan Chuah, Aaron Al Shekaili, Jalila Hassan, Batool Al Sukaiti, Nashat Foote, Simon J. Cook, Matthew C. Andrews, T. Daniel |
author_sort | Field, Matthew A. |
collection | PubMed |
description | BACKGROUND: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. RESULTS: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2–300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3–5000 recurrent false positive variants per mouse – the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. CONCLUSION: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5863-2) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6631443 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-66314432019-07-24 Recurrent miscalling of missense variation from short-read genome sequence data Field, Matthew A. Burgio, Gaetan Chuah, Aaron Al Shekaili, Jalila Hassan, Batool Al Sukaiti, Nashat Foote, Simon J. Cook, Matthew C. Andrews, T. Daniel BMC Genomics Research BACKGROUND: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. RESULTS: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2–300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3–5000 recurrent false positive variants per mouse – the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. CONCLUSION: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5863-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-07-16 /pmc/articles/PMC6631443/ /pubmed/31307400 http://dx.doi.org/10.1186/s12864-019-5863-2 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Field, Matthew A. Burgio, Gaetan Chuah, Aaron Al Shekaili, Jalila Hassan, Batool Al Sukaiti, Nashat Foote, Simon J. Cook, Matthew C. Andrews, T. Daniel Recurrent miscalling of missense variation from short-read genome sequence data |
title | Recurrent miscalling of missense variation from short-read genome sequence data |
title_full | Recurrent miscalling of missense variation from short-read genome sequence data |
title_fullStr | Recurrent miscalling of missense variation from short-read genome sequence data |
title_full_unstemmed | Recurrent miscalling of missense variation from short-read genome sequence data |
title_short | Recurrent miscalling of missense variation from short-read genome sequence data |
title_sort | recurrent miscalling of missense variation from short-read genome sequence data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6631443/ https://www.ncbi.nlm.nih.gov/pubmed/31307400 http://dx.doi.org/10.1186/s12864-019-5863-2 |
work_keys_str_mv | AT fieldmatthewa recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT burgiogaetan recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT chuahaaron recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT alshekailijalila recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT hassanbatool recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT alsukaitinashat recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT footesimonj recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT cookmatthewc recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata AT andrewstdaniel recurrentmiscallingofmissensevariationfromshortreadgenomesequencedata |