Cargando…

phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R

The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bennett, Dominic J., Hettling, Hannes, Silvestro, Daniele, Zizka, Alexander, Bacon, Christine D., Faurby, Søren, Vos, Rutger A., Antonelli, Alexandre
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2018
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6027284/ https://www.ncbi.nlm.nih.gov/pubmed/29874797 http://dx.doi.org/10.3390/life8020020

_version_	1783336576029294592
author	Bennett, Dominic J. Hettling, Hannes Silvestro, Daniele Zizka, Alexander Bacon, Christine D. Faurby, Søren Vos, Rutger A. Antonelli, Alexandre
author_facet	Bennett, Dominic J. Hettling, Hannes Silvestro, Daniele Zizka, Alexander Bacon, Christine D. Faurby, Søren Vos, Rutger A. Antonelli, Alexandre
author_sort	Bennett, Dominic J.
collection	PubMed
description	The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabeling encountered when searching for suitable sequences for phylogenetic analysis. These issues include, among others, the incorrect identification of sequenced species, non-standardized and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate and test our pipeline’s effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: Palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis.
format	Online Article Text
id	pubmed-6027284
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-60272842018-07-13 phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R Bennett, Dominic J. Hettling, Hannes Silvestro, Daniele Zizka, Alexander Bacon, Christine D. Faurby, Søren Vos, Rutger A. Antonelli, Alexandre Life (Basel) Technical Note The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabeling encountered when searching for suitable sequences for phylogenetic analysis. These issues include, among others, the incorrect identification of sequenced species, non-standardized and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate and test our pipeline’s effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: Palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis. MDPI 2018-06-05 /pmc/articles/PMC6027284/ /pubmed/29874797 http://dx.doi.org/10.3390/life8020020 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Technical Note Bennett, Dominic J. Hettling, Hannes Silvestro, Daniele Zizka, Alexander Bacon, Christine D. Faurby, Søren Vos, Rutger A. Antonelli, Alexandre phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title	phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title_full	phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title_fullStr	phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title_full_unstemmed	phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title_short	phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R
title_sort	phylotar: an automated pipeline for retrieving orthologous dna sequences from genbank in r
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6027284/ https://www.ncbi.nlm.nih.gov/pubmed/29874797 http://dx.doi.org/10.3390/life8020020
work_keys_str_mv	AT bennettdominicj phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT hettlinghannes phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT silvestrodaniele phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT zizkaalexander phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT baconchristined phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT faurbysøren phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT vosrutgera phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr AT antonellialexandre phylotaranautomatedpipelineforretrievingorthologousdnasequencesfromgenbankinr

phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R

Ejemplares similares