Cargando…

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains...

Descripción completa

Detalles Bibliográficos
Autores principales:	Doğan, Tunca, Karaçalı, Bilge
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3771926/ https://www.ncbi.nlm.nih.gov/pubmed/24069417 http://dx.doi.org/10.1371/journal.pone.0075458

_version_	1782284246979706880
author	Doğan, Tunca Karaçalı, Bilge
author_facet	Doğan, Tunca Karaçalı, Bilge
author_sort	Doğan, Tunca
collection	PubMed
description	Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
format	Online Article Text
id	pubmed-3771926
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-37719262013-09-25 Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences Doğan, Tunca Karaçalı, Bilge PLoS One Research Article Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences. Public Library of Science 2013-09-12 /pmc/articles/PMC3771926/ /pubmed/24069417 http://dx.doi.org/10.1371/journal.pone.0075458 Text en © 2013 Doğan, Karaçalı http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Doğan, Tunca Karaçalı, Bilge Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title_full	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title_fullStr	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title_full_unstemmed	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title_short	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
title_sort	automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3771926/ https://www.ncbi.nlm.nih.gov/pubmed/24069417 http://dx.doi.org/10.1371/journal.pone.0075458
work_keys_str_mv	AT dogantunca automaticidentificationofhighlyconservedfamilyregionsandrelationshipsingenomewidedatasetsincludingremoteproteinsequences AT karacalıbilge automaticidentificationofhighlyconservedfamilyregionsandrelationshipsingenomewidedatasetsincludingremoteproteinsequences

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Ejemplares similares