Cargando…

Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length

Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode differen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Villanueva-Cañas, José Luis, Laurie, Steve, Albà, M. Mar
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3590775/ https://www.ncbi.nlm.nih.gov/pubmed/23377868 http://dx.doi.org/10.1093/gbe/evt017

_version_	1782261929079734272
author	Villanueva-Cañas, José Luis Laurie, Steve Albà, M. Mar
author_facet	Villanueva-Cañas, José Luis Laurie, Steve Albà, M. Mar
author_sort	Villanueva-Cañas, José Luis
collection	PubMed
description	Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene centered, one single-protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gene (Longest), presumably to avoid including partial sequences and to maximize sequence information. Here, we show that this approach is problematic because it increases the number of indels in the alignments due to the inclusion of nonhomologous regions, such as those derived from species-specific exons, increasing the number of misaligned positions. With the aim of ameliorating this problem, we have developed a novel heuristic, Protein ALignment Optimizer (PALO), which, for each gene family, selects the combination of protein isoforms that are most similar in length. We examine several evolutionary parameters inferred from alignments in which the only difference is the method used to select the protein isoform combination: Longest, PALO, the combination that results in the highest sequence conservation, and a randomly selected combination. We observe that Longest tends to overestimate both nonsynonymous and synonymous substitution rates when compared with PALO, which is most likely due to an excess of misaligned positions. The estimation of the fraction of genes that have experienced positive selection by maximum likelihood is very sensitive to the method of isoform selection employed, both when alignments are constructed with MAFFT and with Prank(+F). Longest performs better than a random combination but still estimates up to 3 times more positively selected genes than the combination showing the highest conservation, indicating the presence of many false positives. We show that PALO can eliminate the majority of such false positives and thus that it is a more appropriate approach for large-scale analyses than Longest. A web server has been set up to facilitate the use of PALO given a user-defined set of gene families; it is available at http://evolutionarygenomics.imim.es/palo.
format	Online Article Text
id	pubmed-3590775
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-35907752013-03-07 Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length Villanueva-Cañas, José Luis Laurie, Steve Albà, M. Mar Genome Biol Evol Research Article Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene centered, one single-protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gene (Longest), presumably to avoid including partial sequences and to maximize sequence information. Here, we show that this approach is problematic because it increases the number of indels in the alignments due to the inclusion of nonhomologous regions, such as those derived from species-specific exons, increasing the number of misaligned positions. With the aim of ameliorating this problem, we have developed a novel heuristic, Protein ALignment Optimizer (PALO), which, for each gene family, selects the combination of protein isoforms that are most similar in length. We examine several evolutionary parameters inferred from alignments in which the only difference is the method used to select the protein isoform combination: Longest, PALO, the combination that results in the highest sequence conservation, and a randomly selected combination. We observe that Longest tends to overestimate both nonsynonymous and synonymous substitution rates when compared with PALO, which is most likely due to an excess of misaligned positions. The estimation of the fraction of genes that have experienced positive selection by maximum likelihood is very sensitive to the method of isoform selection employed, both when alignments are constructed with MAFFT and with Prank(+F). Longest performs better than a random combination but still estimates up to 3 times more positively selected genes than the combination showing the highest conservation, indicating the presence of many false positives. We show that PALO can eliminate the majority of such false positives and thus that it is a more appropriate approach for large-scale analyses than Longest. A web server has been set up to facilitate the use of PALO given a user-defined set of gene families; it is available at http://evolutionarygenomics.imim.es/palo. Oxford University Press 2013 2013-01-31 /pmc/articles/PMC3590775/ /pubmed/23377868 http://dx.doi.org/10.1093/gbe/evt017 Text en © The Author(s) 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by-nc/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Villanueva-Cañas, José Luis Laurie, Steve Albà, M. Mar Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title	Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title_full	Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title_fullStr	Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title_full_unstemmed	Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title_short	Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length
title_sort	improving genome-wide scans of positive selection by using protein isoforms of similar length
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3590775/ https://www.ncbi.nlm.nih.gov/pubmed/23377868 http://dx.doi.org/10.1093/gbe/evt017
work_keys_str_mv	AT villanuevacanasjoseluis improvinggenomewidescansofpositiveselectionbyusingproteinisoformsofsimilarlength AT lauriesteve improvinggenomewidescansofpositiveselectionbyusingproteinisoformsofsimilarlength AT albammar improvinggenomewidescansofpositiveselectionbyusingproteinisoformsofsimilarlength

Improving Genome-Wide Scans of Positive Selection by Using Protein Isoforms of Similar Length

Ejemplares similares