Cargando…

HIPPI: highly accurate protein family classification with ensembles of HMMs

BACKGROUND: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Nam-phuong, Nute, Michael, Mirarab, Siavash, Warnow, Tandy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5123343/
https://www.ncbi.nlm.nih.gov/pubmed/28185571
http://dx.doi.org/10.1186/s12864-016-3097-0
_version_ 1782469715727220736
author Nguyen, Nam-phuong
Nute, Michael
Mirarab, Siavash
Warnow, Tandy
author_facet Nguyen, Nam-phuong
Nute, Michael
Mirarab, Siavash
Warnow, Tandy
author_sort Nguyen, Nam-phuong
collection PubMed
description BACKGROUND: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. RESULTS: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. CONCLUSION: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3097-0) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5123343
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-51233432016-12-06 HIPPI: highly accurate protein family classification with ensembles of HMMs Nguyen, Nam-phuong Nute, Michael Mirarab, Siavash Warnow, Tandy BMC Genomics Research BACKGROUND: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. RESULTS: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. CONCLUSION: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-3097-0) contains supplementary material, which is available to authorized users. BioMed Central 2016-11-11 /pmc/articles/PMC5123343/ /pubmed/28185571 http://dx.doi.org/10.1186/s12864-016-3097-0 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Nguyen, Nam-phuong
Nute, Michael
Mirarab, Siavash
Warnow, Tandy
HIPPI: highly accurate protein family classification with ensembles of HMMs
title HIPPI: highly accurate protein family classification with ensembles of HMMs
title_full HIPPI: highly accurate protein family classification with ensembles of HMMs
title_fullStr HIPPI: highly accurate protein family classification with ensembles of HMMs
title_full_unstemmed HIPPI: highly accurate protein family classification with ensembles of HMMs
title_short HIPPI: highly accurate protein family classification with ensembles of HMMs
title_sort hippi: highly accurate protein family classification with ensembles of hmms
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5123343/
https://www.ncbi.nlm.nih.gov/pubmed/28185571
http://dx.doi.org/10.1186/s12864-016-3097-0
work_keys_str_mv AT nguyennamphuong hippihighlyaccurateproteinfamilyclassificationwithensemblesofhmms
AT nutemichael hippihighlyaccurateproteinfamilyclassificationwithensemblesofhmms
AT mirarabsiavash hippihighlyaccurateproteinfamilyclassificationwithensemblesofhmms
AT warnowtandy hippihighlyaccurateproteinfamilyclassificationwithensemblesofhmms