Cargando…
Benchmarking the next generation of homology inference tools
Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013910/ https://www.ncbi.nlm.nih.gov/pubmed/27256311 http://dx.doi.org/10.1093/bioinformatics/btw305 |
_version_ | 1782452237329498112 |
---|---|
author | Saripella, Ganapathi Varma Sonnhammer, Erik L. L. Forslund, Kristoffer |
author_facet | Saripella, Ganapathi Varma Sonnhammer, Erik L. L. Forslund, Kristoffer |
author_sort | Saripella, Ganapathi Varma |
collection | PubMed |
description | Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact: forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-5013910 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-50139102016-09-12 Benchmarking the next generation of homology inference tools Saripella, Ganapathi Varma Sonnhammer, Erik L. L. Forslund, Kristoffer Bioinformatics Original Papers Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact: forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2016-09-01 2016-06-01 /pmc/articles/PMC5013910/ /pubmed/27256311 http://dx.doi.org/10.1093/bioinformatics/btw305 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Saripella, Ganapathi Varma Sonnhammer, Erik L. L. Forslund, Kristoffer Benchmarking the next generation of homology inference tools |
title | Benchmarking the next generation of homology inference tools |
title_full | Benchmarking the next generation of homology inference tools |
title_fullStr | Benchmarking the next generation of homology inference tools |
title_full_unstemmed | Benchmarking the next generation of homology inference tools |
title_short | Benchmarking the next generation of homology inference tools |
title_sort | benchmarking the next generation of homology inference tools |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013910/ https://www.ncbi.nlm.nih.gov/pubmed/27256311 http://dx.doi.org/10.1093/bioinformatics/btw305 |
work_keys_str_mv | AT saripellaganapathivarma benchmarkingthenextgenerationofhomologyinferencetools AT sonnhammererikll benchmarkingthenextgenerationofhomologyinferencetools AT forslundkristoffer benchmarkingthenextgenerationofhomologyinferencetools |