Cargando…
OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches
MOTIVATION: Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8479680/ https://www.ncbi.nlm.nih.gov/pubmed/33787851 http://dx.doi.org/10.1093/bioinformatics/btab219 |
_version_ | 1784576310264201216 |
---|---|
author | Rossier, Victor Warwick Vesztrocy, Alex Robinson-Rechavi, Marc Dessimoz, Christophe |
author_facet | Rossier, Victor Warwick Vesztrocy, Alex Robinson-Rechavi, Marc Dessimoz, Christophe |
author_sort | Rossier, Victor |
collection | PubMed |
description | MOTIVATION: Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. RESULTS: Here, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. AVAILABILITYAND IMPLEMENTATION: OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8479680 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-84796802021-09-30 OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches Rossier, Victor Warwick Vesztrocy, Alex Robinson-Rechavi, Marc Dessimoz, Christophe Bioinformatics Original Papers MOTIVATION: Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. RESULTS: Here, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. AVAILABILITYAND IMPLEMENTATION: OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-03-31 /pmc/articles/PMC8479680/ /pubmed/33787851 http://dx.doi.org/10.1093/bioinformatics/btab219 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Rossier, Victor Warwick Vesztrocy, Alex Robinson-Rechavi, Marc Dessimoz, Christophe OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title | OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title_full | OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title_fullStr | OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title_full_unstemmed | OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title_short | OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
title_sort | omamer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8479680/ https://www.ncbi.nlm.nih.gov/pubmed/33787851 http://dx.doi.org/10.1093/bioinformatics/btab219 |
work_keys_str_mv | AT rossiervictor omamertreedrivenandalignmentfreeproteinassignmenttosubfamiliesoutperformsclosestsequenceapproaches AT warwickvesztrocyalex omamertreedrivenandalignmentfreeproteinassignmenttosubfamiliesoutperformsclosestsequenceapproaches AT robinsonrechavimarc omamertreedrivenandalignmentfreeproteinassignmenttosubfamiliesoutperformsclosestsequenceapproaches AT dessimozchristophe omamertreedrivenandalignmentfreeproteinassignmenttosubfamiliesoutperformsclosestsequenceapproaches |