Cargando…

Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders

There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whe...

Descripción completa

Detalles Bibliográficos
Autores principales: Yates, T.M, Lain, A, Campbell, J, FitzPatrick, D R, Simpson, T I
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216525/
https://www.ncbi.nlm.nih.gov/pubmed/35670729
http://dx.doi.org/10.1093/database/baac038
_version_ 1784731442407800832
author Yates, T.M
Lain, A
Campbell, J
FitzPatrick, D R
Simpson, T I
author_facet Yates, T.M
Lain, A
Campbell, J
FitzPatrick, D R
Simpson, T I
author_sort Yates, T.M
collection PubMed
description There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for the extraction of categorical phenotypic descriptors from the full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76–84% precision and 65–73% recall. Mean terms per paper increased from 9 in title + abstract, to 68 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than widely used manually curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. The area under the curve for receiver operating characteristic (ROC) curves increased by 5–10% through the use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines. Database URL: https://doi.org/10.1093/database/baac038
format Online
Article
Text
id pubmed-9216525
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92165252022-06-23 Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders Yates, T.M Lain, A Campbell, J FitzPatrick, D R Simpson, T I Database (Oxford) Original Article There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for the extraction of categorical phenotypic descriptors from the full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76–84% precision and 65–73% recall. Mean terms per paper increased from 9 in title + abstract, to 68 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than widely used manually curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. The area under the curve for receiver operating characteristic (ROC) curves increased by 5–10% through the use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines. Database URL: https://doi.org/10.1093/database/baac038 Oxford University Press 2022-06-07 /pmc/articles/PMC9216525/ /pubmed/35670729 http://dx.doi.org/10.1093/database/baac038 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Article
Yates, T.M
Lain, A
Campbell, J
FitzPatrick, D R
Simpson, T I
Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title_full Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title_fullStr Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title_full_unstemmed Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title_short Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
title_sort creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216525/
https://www.ncbi.nlm.nih.gov/pubmed/35670729
http://dx.doi.org/10.1093/database/baac038
work_keys_str_mv AT yatestm creationandevaluationoffulltextliteraturederivedfeatureweighteddiseasemodelsofgeneticallydetermineddevelopmentaldisorders
AT laina creationandevaluationoffulltextliteraturederivedfeatureweighteddiseasemodelsofgeneticallydetermineddevelopmentaldisorders
AT campbellj creationandevaluationoffulltextliteraturederivedfeatureweighteddiseasemodelsofgeneticallydetermineddevelopmentaldisorders
AT fitzpatrickdr creationandevaluationoffulltextliteraturederivedfeatureweighteddiseasemodelsofgeneticallydetermineddevelopmentaldisorders
AT simpsonti creationandevaluationoffulltextliteraturederivedfeatureweighteddiseasemodelsofgeneticallydetermineddevelopmentaldisorders