Cargando…

Annotation of the Giardia proteome through structure-based homology and machine learning

BACKGROUND: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ansell, Brendan R E, Pope, Bernard J, Georgeson, Peter, Emery-Corbin, Samantha J, Jex, Aaron R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6312909/ https://www.ncbi.nlm.nih.gov/pubmed/30520990 http://dx.doi.org/10.1093/gigascience/giy150

_version_	1783383848857370624
author	Ansell, Brendan R E Pope, Bernard J Georgeson, Peter Emery-Corbin, Samantha J Jex, Aaron R
author_facet	Ansell, Brendan R E Pope, Bernard J Georgeson, Peter Emery-Corbin, Samantha J Jex, Aaron R
author_sort	Ansell, Brendan R E
collection	PubMed
description	BACKGROUND: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures. AIMS: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination. METHODS: We used the I-TASSER suite to predict structural models for ∼5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier. RESULTS: We identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis—a system central to the efficacy of limited anti-giardial drugs. CONCLUSION: Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.
format	Online Article Text
id	pubmed-6312909
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-63129092019-01-10 Annotation of the Giardia proteome through structure-based homology and machine learning Ansell, Brendan R E Pope, Bernard J Georgeson, Peter Emery-Corbin, Samantha J Jex, Aaron R Gigascience Research BACKGROUND: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures. AIMS: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination. METHODS: We used the I-TASSER suite to predict structural models for ∼5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier. RESULTS: We identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis—a system central to the efficacy of limited anti-giardial drugs. CONCLUSION: Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation. Oxford University Press 2018-12-06 /pmc/articles/PMC6312909/ /pubmed/30520990 http://dx.doi.org/10.1093/gigascience/giy150 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Ansell, Brendan R E Pope, Bernard J Georgeson, Peter Emery-Corbin, Samantha J Jex, Aaron R Annotation of the Giardia proteome through structure-based homology and machine learning
title	Annotation of the Giardia proteome through structure-based homology and machine learning
title_full	Annotation of the Giardia proteome through structure-based homology and machine learning
title_fullStr	Annotation of the Giardia proteome through structure-based homology and machine learning
title_full_unstemmed	Annotation of the Giardia proteome through structure-based homology and machine learning
title_short	Annotation of the Giardia proteome through structure-based homology and machine learning
title_sort	annotation of the giardia proteome through structure-based homology and machine learning
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6312909/ https://www.ncbi.nlm.nih.gov/pubmed/30520990 http://dx.doi.org/10.1093/gigascience/giy150
work_keys_str_mv	AT ansellbrendanre annotationofthegiardiaproteomethroughstructurebasedhomologyandmachinelearning AT popebernardj annotationofthegiardiaproteomethroughstructurebasedhomologyandmachinelearning AT georgesonpeter annotationofthegiardiaproteomethroughstructurebasedhomologyandmachinelearning AT emerycorbinsamanthaj annotationofthegiardiaproteomethroughstructurebasedhomologyandmachinelearning AT jexaaronr annotationofthegiardiaproteomethroughstructurebasedhomologyandmachinelearning

Annotation of the Giardia proteome through structure-based homology and machine learning

Ejemplares similares