Cargando…
Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets
Recent advances in DNA sequencing have expanded our understanding of the molecular basis of genetic disorders and increased the utilization of clinical genomic tests. Given the paucity of evidence to accurately classify each variant and the difficulty of experimentally evaluating its clinical signif...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6633260/ https://www.ncbi.nlm.nih.gov/pubmed/31235655 http://dx.doi.org/10.1101/gr.240994.118 |
_version_ | 1783435717516460032 |
---|---|
author | Evans, Perry Wu, Chao Lindy, Amanda McKnight, Dianalee A. Lebo, Matthew Sarmady, Mahdi Abou Tayoun, Ahmad N. |
author_facet | Evans, Perry Wu, Chao Lindy, Amanda McKnight, Dianalee A. Lebo, Matthew Sarmady, Mahdi Abou Tayoun, Ahmad N. |
author_sort | Evans, Perry |
collection | PubMed |
description | Recent advances in DNA sequencing have expanded our understanding of the molecular basis of genetic disorders and increased the utilization of clinical genomic tests. Given the paucity of evidence to accurately classify each variant and the difficulty of experimentally evaluating its clinical significance, a large number of variants generated by clinical tests are reported as variants of unknown clinical significance. Population-scale variant databases can improve clinical interpretation. Specifically, pathogenicity prediction for novel missense variants can use features describing regional variant constraint. Constrained genomic regions are those that have an unusually low variant count in the general population. Computational methods have been introduced to capture these regions and incorporate them into pathogenicity classifiers, but these methods have yet to be compared on an independent clinical variant data set. Here, we introduce one variant data set derived from clinical sequencing panels and use it to compare the ability of different genomic constraint metrics to determine missense variant pathogenicity. This data set is compiled from 17,071 patients surveyed with clinical genomic sequencing for cardiomyopathy, epilepsy, or RASopathies. We further use this data set to demonstrate the necessity of disease-specific classifiers and to train PathoPredictor, a disease-specific ensemble classifier of pathogenicity based on regional constraint and variant-level features. PathoPredictor achieves an average precision >90% for variants from all 99 tested disease genes while approaching 100% accuracy for some genes. The accumulation of larger clinical variant training data sets can significantly enhance their performance in a disease- and gene-specific manner. |
format | Online Article Text |
id | pubmed-6633260 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-66332602020-01-01 Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets Evans, Perry Wu, Chao Lindy, Amanda McKnight, Dianalee A. Lebo, Matthew Sarmady, Mahdi Abou Tayoun, Ahmad N. Genome Res Method Recent advances in DNA sequencing have expanded our understanding of the molecular basis of genetic disorders and increased the utilization of clinical genomic tests. Given the paucity of evidence to accurately classify each variant and the difficulty of experimentally evaluating its clinical significance, a large number of variants generated by clinical tests are reported as variants of unknown clinical significance. Population-scale variant databases can improve clinical interpretation. Specifically, pathogenicity prediction for novel missense variants can use features describing regional variant constraint. Constrained genomic regions are those that have an unusually low variant count in the general population. Computational methods have been introduced to capture these regions and incorporate them into pathogenicity classifiers, but these methods have yet to be compared on an independent clinical variant data set. Here, we introduce one variant data set derived from clinical sequencing panels and use it to compare the ability of different genomic constraint metrics to determine missense variant pathogenicity. This data set is compiled from 17,071 patients surveyed with clinical genomic sequencing for cardiomyopathy, epilepsy, or RASopathies. We further use this data set to demonstrate the necessity of disease-specific classifiers and to train PathoPredictor, a disease-specific ensemble classifier of pathogenicity based on regional constraint and variant-level features. PathoPredictor achieves an average precision >90% for variants from all 99 tested disease genes while approaching 100% accuracy for some genes. The accumulation of larger clinical variant training data sets can significantly enhance their performance in a disease- and gene-specific manner. Cold Spring Harbor Laboratory Press 2019-07 /pmc/articles/PMC6633260/ /pubmed/31235655 http://dx.doi.org/10.1101/gr.240994.118 Text en © 2019 Evans et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Method Evans, Perry Wu, Chao Lindy, Amanda McKnight, Dianalee A. Lebo, Matthew Sarmady, Mahdi Abou Tayoun, Ahmad N. Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title | Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title_full | Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title_fullStr | Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title_full_unstemmed | Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title_short | Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
title_sort | genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6633260/ https://www.ncbi.nlm.nih.gov/pubmed/31235655 http://dx.doi.org/10.1101/gr.240994.118 |
work_keys_str_mv | AT evansperry geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT wuchao geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT lindyamanda geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT mcknightdianaleea geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT lebomatthew geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT sarmadymahdi geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets AT aboutayounahmadn geneticvariantpathogenicitypredictiontrainedusingdiseasespecificclinicalsequencingdatasets |