Cargando…

Seqenv: linking sequences to environments through text mining

Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are of...

Descripción completa

Detalles Bibliográficos
Autores principales: Sinclair, Lucas, Ijaz, Umer Z., Jensen, Lars Juhl, Coolen, Marco J.L., Gubry-Rangin, Cecile, Chroňáková, Alica, Oulas, Anastasis, Pavloudi, Christina, Schnetzer, Julia, Weimann, Aaron, Ijaz, Ali, Eiler, Alexander, Quince, Christopher, Pafilis, Evangelos
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5178346/
https://www.ncbi.nlm.nih.gov/pubmed/28028456
http://dx.doi.org/10.7717/peerj.2690
_version_ 1782485162650501120
author Sinclair, Lucas
Ijaz, Umer Z.
Jensen, Lars Juhl
Coolen, Marco J.L.
Gubry-Rangin, Cecile
Chroňáková, Alica
Oulas, Anastasis
Pavloudi, Christina
Schnetzer, Julia
Weimann, Aaron
Ijaz, Ali
Eiler, Alexander
Quince, Christopher
Pafilis, Evangelos
author_facet Sinclair, Lucas
Ijaz, Umer Z.
Jensen, Lars Juhl
Coolen, Marco J.L.
Gubry-Rangin, Cecile
Chroňáková, Alica
Oulas, Anastasis
Pavloudi, Christina
Schnetzer, Julia
Weimann, Aaron
Ijaz, Ali
Eiler, Alexander
Quince, Christopher
Pafilis, Evangelos
author_sort Sinclair, Lucas
collection PubMed
description Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.
format Online
Article
Text
id pubmed-5178346
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-51783462016-12-27 Seqenv: linking sequences to environments through text mining Sinclair, Lucas Ijaz, Umer Z. Jensen, Lars Juhl Coolen, Marco J.L. Gubry-Rangin, Cecile Chroňáková, Alica Oulas, Anastasis Pavloudi, Christina Schnetzer, Julia Weimann, Aaron Ijaz, Ali Eiler, Alexander Quince, Christopher Pafilis, Evangelos PeerJ Bioinformatics Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv. PeerJ Inc. 2016-12-20 /pmc/articles/PMC5178346/ /pubmed/28028456 http://dx.doi.org/10.7717/peerj.2690 Text en ©2016 Sinclair et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Sinclair, Lucas
Ijaz, Umer Z.
Jensen, Lars Juhl
Coolen, Marco J.L.
Gubry-Rangin, Cecile
Chroňáková, Alica
Oulas, Anastasis
Pavloudi, Christina
Schnetzer, Julia
Weimann, Aaron
Ijaz, Ali
Eiler, Alexander
Quince, Christopher
Pafilis, Evangelos
Seqenv: linking sequences to environments through text mining
title Seqenv: linking sequences to environments through text mining
title_full Seqenv: linking sequences to environments through text mining
title_fullStr Seqenv: linking sequences to environments through text mining
title_full_unstemmed Seqenv: linking sequences to environments through text mining
title_short Seqenv: linking sequences to environments through text mining
title_sort seqenv: linking sequences to environments through text mining
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5178346/
https://www.ncbi.nlm.nih.gov/pubmed/28028456
http://dx.doi.org/10.7717/peerj.2690
work_keys_str_mv AT sinclairlucas seqenvlinkingsequencestoenvironmentsthroughtextmining
AT ijazumerz seqenvlinkingsequencestoenvironmentsthroughtextmining
AT jensenlarsjuhl seqenvlinkingsequencestoenvironmentsthroughtextmining
AT coolenmarcojl seqenvlinkingsequencestoenvironmentsthroughtextmining
AT gubryrangincecile seqenvlinkingsequencestoenvironmentsthroughtextmining
AT chronakovaalica seqenvlinkingsequencestoenvironmentsthroughtextmining
AT oulasanastasis seqenvlinkingsequencestoenvironmentsthroughtextmining
AT pavloudichristina seqenvlinkingsequencestoenvironmentsthroughtextmining
AT schnetzerjulia seqenvlinkingsequencestoenvironmentsthroughtextmining
AT weimannaaron seqenvlinkingsequencestoenvironmentsthroughtextmining
AT ijazali seqenvlinkingsequencestoenvironmentsthroughtextmining
AT eileralexander seqenvlinkingsequencestoenvironmentsthroughtextmining
AT quincechristopher seqenvlinkingsequencestoenvironmentsthroughtextmining
AT pafilisevangelos seqenvlinkingsequencestoenvironmentsthroughtextmining