Cargando…

Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm

BACKGROUND: Clustering protein sequences according to inferred homology is a fundamental step in the analysis of many large data sets. Since the publication of the Markov Clustering (MCL) algorithm in 2002, it has been the centerpiece of several popular applications. Each of these approaches generat...

Descripción completa

Detalles Bibliográficos
Autores principales: Gibbons, Theodore R., Mount, Stephen M., Cooper, Endymion D., Delwiche, Charles F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496851/
https://www.ncbi.nlm.nih.gov/pubmed/26160651
http://dx.doi.org/10.1186/s12859-015-0625-x
_version_ 1782380470660497408
author Gibbons, Theodore R.
Mount, Stephen M.
Cooper, Endymion D.
Delwiche, Charles F.
author_facet Gibbons, Theodore R.
Mount, Stephen M.
Cooper, Endymion D.
Delwiche, Charles F.
author_sort Gibbons, Theodore R.
collection PubMed
description BACKGROUND: Clustering protein sequences according to inferred homology is a fundamental step in the analysis of many large data sets. Since the publication of the Markov Clustering (MCL) algorithm in 2002, it has been the centerpiece of several popular applications. Each of these approaches generates an undirected graph that represents sequences as nodes connected to each other by edges weighted with a BLAST-based metric. MCL is then used to infer clusters of homologous proteins by analyzing these graphs. The various approaches differ only by how they weight the edges, yet there has been very little direct examination of the relative performance of alternative edge-weighting metrics. This study compares the performance of four BLAST-based edge-weighting metrics: the bit score, bit score ratio (BSR), bit score over anchored length (BAL), and negative common log of the expectation value (NLE). Performance is tested using the Extended CEGMA KOGs (ECK) database, which we introduce here. RESULTS: All metrics performed similarly when analyzing full-length sequences, but dramatic differences emerged as progressively larger fractions of the test sequences were split into fragments. The BSR and BAL successfully rescued subsets of clusters by strengthening certain types of alignments between fragmented sequences, but also shifted the largest correct scores down near the range of scores generated from spurious alignments. This penalty outweighed the benefits in most test cases, and was greatly exacerbated by increasing the MCL inflation parameter, making these metrics less robust than the bit score or the more popular NLE. Notably, the bit score performed as well or better than the other three metrics in all scenarios. CONCLUSIONS: The results provide a strong case for use of the bit score, which appears to offer equivalent or superior performance to the more popular NLE. The insight that MCL-based clustering methods can be improved using a more tractable edge-weighting metric will greatly simplify future implementations. We demonstrate this with our own minimalist Python implementation: Porthos, which uses only standard libraries and can process a graph with 25 m + edges connecting the 60 k + KOG sequences in half a minute using less than half a gigabyte of memory. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0625-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4496851
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-44968512015-07-10 Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm Gibbons, Theodore R. Mount, Stephen M. Cooper, Endymion D. Delwiche, Charles F. BMC Bioinformatics Research Article BACKGROUND: Clustering protein sequences according to inferred homology is a fundamental step in the analysis of many large data sets. Since the publication of the Markov Clustering (MCL) algorithm in 2002, it has been the centerpiece of several popular applications. Each of these approaches generates an undirected graph that represents sequences as nodes connected to each other by edges weighted with a BLAST-based metric. MCL is then used to infer clusters of homologous proteins by analyzing these graphs. The various approaches differ only by how they weight the edges, yet there has been very little direct examination of the relative performance of alternative edge-weighting metrics. This study compares the performance of four BLAST-based edge-weighting metrics: the bit score, bit score ratio (BSR), bit score over anchored length (BAL), and negative common log of the expectation value (NLE). Performance is tested using the Extended CEGMA KOGs (ECK) database, which we introduce here. RESULTS: All metrics performed similarly when analyzing full-length sequences, but dramatic differences emerged as progressively larger fractions of the test sequences were split into fragments. The BSR and BAL successfully rescued subsets of clusters by strengthening certain types of alignments between fragmented sequences, but also shifted the largest correct scores down near the range of scores generated from spurious alignments. This penalty outweighed the benefits in most test cases, and was greatly exacerbated by increasing the MCL inflation parameter, making these metrics less robust than the bit score or the more popular NLE. Notably, the bit score performed as well or better than the other three metrics in all scenarios. CONCLUSIONS: The results provide a strong case for use of the bit score, which appears to offer equivalent or superior performance to the more popular NLE. The insight that MCL-based clustering methods can be improved using a more tractable edge-weighting metric will greatly simplify future implementations. We demonstrate this with our own minimalist Python implementation: Porthos, which uses only standard libraries and can process a graph with 25 m + edges connecting the 60 k + KOG sequences in half a minute using less than half a gigabyte of memory. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0625-x) contains supplementary material, which is available to authorized users. BioMed Central 2015-07-10 /pmc/articles/PMC4496851/ /pubmed/26160651 http://dx.doi.org/10.1186/s12859-015-0625-x Text en © Gibbons et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Gibbons, Theodore R.
Mount, Stephen M.
Cooper, Endymion D.
Delwiche, Charles F.
Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title_full Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title_fullStr Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title_full_unstemmed Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title_short Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm
title_sort evaluation of blast-based edge-weighting metrics used for homology inference with the markov clustering algorithm
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496851/
https://www.ncbi.nlm.nih.gov/pubmed/26160651
http://dx.doi.org/10.1186/s12859-015-0625-x
work_keys_str_mv AT gibbonstheodorer evaluationofblastbasededgeweightingmetricsusedforhomologyinferencewiththemarkovclusteringalgorithm
AT mountstephenm evaluationofblastbasededgeweightingmetricsusedforhomologyinferencewiththemarkovclusteringalgorithm
AT cooperendymiond evaluationofblastbasededgeweightingmetricsusedforhomologyinferencewiththemarkovclusteringalgorithm
AT delwichecharlesf evaluationofblastbasededgeweightingmetricsusedforhomologyinferencewiththemarkovclusteringalgorithm