Cargando…

Benchmarking the next generation of homology inference tools

Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs...

Descripción completa

Detalles Bibliográficos
Autores principales: Saripella, Ganapathi Varma, Sonnhammer, Erik L. L., Forslund, Kristoffer
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013910/
https://www.ncbi.nlm.nih.gov/pubmed/27256311
http://dx.doi.org/10.1093/bioinformatics/btw305
_version_ 1782452237329498112
author Saripella, Ganapathi Varma
Sonnhammer, Erik L. L.
Forslund, Kristoffer
author_facet Saripella, Ganapathi Varma
Sonnhammer, Erik L. L.
Forslund, Kristoffer
author_sort Saripella, Ganapathi Varma
collection PubMed
description Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact: forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5013910
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-50139102016-09-12 Benchmarking the next generation of homology inference tools Saripella, Ganapathi Varma Sonnhammer, Erik L. L. Forslund, Kristoffer Bioinformatics Original Papers Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact: forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2016-09-01 2016-06-01 /pmc/articles/PMC5013910/ /pubmed/27256311 http://dx.doi.org/10.1093/bioinformatics/btw305 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Saripella, Ganapathi Varma
Sonnhammer, Erik L. L.
Forslund, Kristoffer
Benchmarking the next generation of homology inference tools
title Benchmarking the next generation of homology inference tools
title_full Benchmarking the next generation of homology inference tools
title_fullStr Benchmarking the next generation of homology inference tools
title_full_unstemmed Benchmarking the next generation of homology inference tools
title_short Benchmarking the next generation of homology inference tools
title_sort benchmarking the next generation of homology inference tools
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013910/
https://www.ncbi.nlm.nih.gov/pubmed/27256311
http://dx.doi.org/10.1093/bioinformatics/btw305
work_keys_str_mv AT saripellaganapathivarma benchmarkingthenextgenerationofhomologyinferencetools
AT sonnhammererikll benchmarkingthenextgenerationofhomologyinferencetools
AT forslundkristoffer benchmarkingthenextgenerationofhomologyinferencetools