Cargando…

Algorithms for matching partially labelled sequence graphs

BACKGROUND: In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. W...

Descripción completa

Detalles Bibliográficos
Autor principal:	Taylor, William R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5613400/ https://www.ncbi.nlm.nih.gov/pubmed/29021818 http://dx.doi.org/10.1186/s13015-017-0115-y

_version_	1783266246986301440
author	Taylor, William R.
author_facet	Taylor, William R.
author_sort	Taylor, William R.
collection	PubMed
description	BACKGROUND: In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. When each protein is unique then the species name is sufficient to guide this match, however, when there are multiple related sequences (paralogs) in each species then the pairing is more difficult. In bacteria a good guide can be gained from genome co-location as interacting proteins tend to be in a common operon but in eukaryotes this simple principle is not sufficient. RESULTS: The methods developed in this paper take sets of paralogs for different proteins found in the same species and make a pairing based on their evolutionary distance relative to a set of other proteins that are unique and so have a known relationship (singletons). The former constitute a set of unlabelled nodes in a graph while the latter are labelled. Two variants were tested, one based on a phylogenetic tree of the sequences (the topology-based method) and a simpler, faster variant based only on the inter-sequence distances (the distance-based method). Over a set of test proteins, both gave good results, with the topology method performing slightly better. CONCLUSIONS: The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-017-0115-y) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5613400
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-56134002017-10-11 Algorithms for matching partially labelled sequence graphs Taylor, William R. Algorithms Mol Biol Research BACKGROUND: In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. When each protein is unique then the species name is sufficient to guide this match, however, when there are multiple related sequences (paralogs) in each species then the pairing is more difficult. In bacteria a good guide can be gained from genome co-location as interacting proteins tend to be in a common operon but in eukaryotes this simple principle is not sufficient. RESULTS: The methods developed in this paper take sets of paralogs for different proteins found in the same species and make a pairing based on their evolutionary distance relative to a set of other proteins that are unique and so have a known relationship (singletons). The former constitute a set of unlabelled nodes in a graph while the latter are labelled. Two variants were tested, one based on a phylogenetic tree of the sequences (the topology-based method) and a simpler, faster variant based only on the inter-sequence distances (the distance-based method). Over a set of test proteins, both gave good results, with the topology method performing slightly better. CONCLUSIONS: The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-017-0115-y) contains supplementary material, which is available to authorized users. BioMed Central 2017-09-25 /pmc/articles/PMC5613400/ /pubmed/29021818 http://dx.doi.org/10.1186/s13015-017-0115-y Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Taylor, William R. Algorithms for matching partially labelled sequence graphs
title	Algorithms for matching partially labelled sequence graphs
title_full	Algorithms for matching partially labelled sequence graphs
title_fullStr	Algorithms for matching partially labelled sequence graphs
title_full_unstemmed	Algorithms for matching partially labelled sequence graphs
title_short	Algorithms for matching partially labelled sequence graphs
title_sort	algorithms for matching partially labelled sequence graphs
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5613400/ https://www.ncbi.nlm.nih.gov/pubmed/29021818 http://dx.doi.org/10.1186/s13015-017-0115-y
work_keys_str_mv	AT taylorwilliamr algorithmsformatchingpartiallylabelledsequencegraphs

Algorithms for matching partially labelled sequence graphs

Ejemplares similares