Cargando…

Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the c...

Descripción completa

Detalles Bibliográficos
Autores principales: Grötzinger, Stefan W., Alam, Intikhab, Ba Alawi, Wail, Bajic, Vladimir B., Stingl, Ulrich, Eppinger, Jörg
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985023/
https://www.ncbi.nlm.nih.gov/pubmed/24778629
http://dx.doi.org/10.3389/fmicb.2014.00134
_version_ 1782311522019573760
author Grötzinger, Stefan W.
Alam, Intikhab
Ba Alawi, Wail
Bajic, Vladimir B.
Stingl, Ulrich
Eppinger, Jörg
author_facet Grötzinger, Stefan W.
Alam, Intikhab
Ba Alawi, Wail
Bajic, Vladimir B.
Stingl, Ulrich
Eppinger, Jörg
author_sort Grötzinger, Stefan W.
collection PubMed
description Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website.
format Online
Article
Text
id pubmed-3985023
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-39850232014-04-28 Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA) Grötzinger, Stefan W. Alam, Intikhab Ba Alawi, Wail Bajic, Vladimir B. Stingl, Ulrich Eppinger, Jörg Front Microbiol Microbiology Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. Frontiers Media S.A. 2014-04-07 /pmc/articles/PMC3985023/ /pubmed/24778629 http://dx.doi.org/10.3389/fmicb.2014.00134 Text en Copyright © 2014 Grötzinger, Alam, Ba Alawi, Bajic, Stingl and Eppinger. http://creativecommons.org/licenses/by/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Grötzinger, Stefan W.
Alam, Intikhab
Ba Alawi, Wail
Bajic, Vladimir B.
Stingl, Ulrich
Eppinger, Jörg
Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title_full Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title_fullStr Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title_full_unstemmed Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title_short Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)
title_sort mining a database of single amplified genomes from red sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (ppma)
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985023/
https://www.ncbi.nlm.nih.gov/pubmed/24778629
http://dx.doi.org/10.3389/fmicb.2014.00134
work_keys_str_mv AT grotzingerstefanw miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma
AT alamintikhab miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma
AT baalawiwail miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma
AT bajicvladimirb miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma
AT stinglulrich miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma
AT eppingerjorg miningadatabaseofsingleamplifiedgenomesfromredseabrinepoolextremophilesimprovingreliabilityofgenefunctionpredictionusingaprofileandpatternmatchingalgorithmppma