Cargando…

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to impro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mahmood, Khalid, Webb, Geoffrey I., Song, Jiangning, Whisstock, James C., Konagurthu, Arun S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2012
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315314/ https://www.ncbi.nlm.nih.gov/pubmed/22210858 http://dx.doi.org/10.1093/nar/gkr1261

_version_	1782228212683636736
author	Mahmood, Khalid Webb, Geoffrey I. Song, Jiangning Whisstock, James C. Konagurthu, Arun S.
author_facet	Mahmood, Khalid Webb, Geoffrey I. Song, Jiangning Whisstock, James C. Konagurthu, Arun S.
author_sort	Mahmood, Khalid
collection	PubMed
description	Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2.
format	Online Article Text
id	pubmed-3315314
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-33153142012-03-30 Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs Mahmood, Khalid Webb, Geoffrey I. Song, Jiangning Whisstock, James C. Konagurthu, Arun S. Nucleic Acids Res Methods Online Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2. Oxford University Press 2012-03 2011-12-29 /pmc/articles/PMC3315314/ /pubmed/22210858 http://dx.doi.org/10.1093/nar/gkr1261 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methods Online Mahmood, Khalid Webb, Geoffrey I. Song, Jiangning Whisstock, James C. Konagurthu, Arun S. Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title	Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title_full	Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title_fullStr	Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title_full_unstemmed	Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title_short	Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
title_sort	efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315314/ https://www.ncbi.nlm.nih.gov/pubmed/22210858 http://dx.doi.org/10.1093/nar/gkr1261
work_keys_str_mv	AT mahmoodkhalid efficientlargescaleproteinsequencecomparisonandgenematchingtoidentifyorthologsandcoorthologs AT webbgeoffreyi efficientlargescaleproteinsequencecomparisonandgenematchingtoidentifyorthologsandcoorthologs AT songjiangning efficientlargescaleproteinsequencecomparisonandgenematchingtoidentifyorthologsandcoorthologs AT whisstockjamesc efficientlargescaleproteinsequencecomparisonandgenematchingtoidentifyorthologsandcoorthologs AT konagurthuaruns efficientlargescaleproteinsequencecomparisonandgenematchingtoidentifyorthologsandcoorthologs

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Ejemplares similares