Cargando…

Efficient inference of homologs in large eukaryotic pan-proteomes

BACKGROUND: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidl...

Descripción completa

Detalles Bibliográficos
Autores principales: Sheikhizadeh Anari, Siavash, de Ridder, Dick, Schranz, M. Eric, Smit, Sandra
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6158922/
https://www.ncbi.nlm.nih.gov/pubmed/30257640
http://dx.doi.org/10.1186/s12859-018-2362-4
_version_ 1783358516788985856
author Sheikhizadeh Anari, Siavash
de Ridder, Dick
Schranz, M. Eric
Smit, Sandra
author_facet Sheikhizadeh Anari, Siavash
de Ridder, Dick
Schranz, M. Eric
Smit, Sandra
author_sort Sheikhizadeh Anari, Siavash
collection PubMed
description BACKGROUND: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. RESULTS: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. CONCLUSIONS: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2362-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6158922
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-61589222018-10-01 Efficient inference of homologs in large eukaryotic pan-proteomes Sheikhizadeh Anari, Siavash de Ridder, Dick Schranz, M. Eric Smit, Sandra BMC Bioinformatics Methodology Article BACKGROUND: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data. RESULTS: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa. CONCLUSIONS: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2362-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-09-26 /pmc/articles/PMC6158922/ /pubmed/30257640 http://dx.doi.org/10.1186/s12859-018-2362-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Sheikhizadeh Anari, Siavash
de Ridder, Dick
Schranz, M. Eric
Smit, Sandra
Efficient inference of homologs in large eukaryotic pan-proteomes
title Efficient inference of homologs in large eukaryotic pan-proteomes
title_full Efficient inference of homologs in large eukaryotic pan-proteomes
title_fullStr Efficient inference of homologs in large eukaryotic pan-proteomes
title_full_unstemmed Efficient inference of homologs in large eukaryotic pan-proteomes
title_short Efficient inference of homologs in large eukaryotic pan-proteomes
title_sort efficient inference of homologs in large eukaryotic pan-proteomes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6158922/
https://www.ncbi.nlm.nih.gov/pubmed/30257640
http://dx.doi.org/10.1186/s12859-018-2362-4
work_keys_str_mv AT sheikhizadehanarisiavash efficientinferenceofhomologsinlargeeukaryoticpanproteomes
AT deridderdick efficientinferenceofhomologsinlargeeukaryoticpanproteomes
AT schranzmeric efficientinferenceofhomologsinlargeeukaryoticpanproteomes
AT smitsandra efficientinferenceofhomologsinlargeeukaryoticpanproteomes