Cargando…
Seqenv: linking sequences to environments through text mining
Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are of...
Autores principales: | , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5178346/ https://www.ncbi.nlm.nih.gov/pubmed/28028456 http://dx.doi.org/10.7717/peerj.2690 |
_version_ | 1782485162650501120 |
---|---|
author | Sinclair, Lucas Ijaz, Umer Z. Jensen, Lars Juhl Coolen, Marco J.L. Gubry-Rangin, Cecile Chroňáková, Alica Oulas, Anastasis Pavloudi, Christina Schnetzer, Julia Weimann, Aaron Ijaz, Ali Eiler, Alexander Quince, Christopher Pafilis, Evangelos |
author_facet | Sinclair, Lucas Ijaz, Umer Z. Jensen, Lars Juhl Coolen, Marco J.L. Gubry-Rangin, Cecile Chroňáková, Alica Oulas, Anastasis Pavloudi, Christina Schnetzer, Julia Weimann, Aaron Ijaz, Ali Eiler, Alexander Quince, Christopher Pafilis, Evangelos |
author_sort | Sinclair, Lucas |
collection | PubMed |
description | Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv. |
format | Online Article Text |
id | pubmed-5178346 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-51783462016-12-27 Seqenv: linking sequences to environments through text mining Sinclair, Lucas Ijaz, Umer Z. Jensen, Lars Juhl Coolen, Marco J.L. Gubry-Rangin, Cecile Chroňáková, Alica Oulas, Anastasis Pavloudi, Christina Schnetzer, Julia Weimann, Aaron Ijaz, Ali Eiler, Alexander Quince, Christopher Pafilis, Evangelos PeerJ Bioinformatics Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv. PeerJ Inc. 2016-12-20 /pmc/articles/PMC5178346/ /pubmed/28028456 http://dx.doi.org/10.7717/peerj.2690 Text en ©2016 Sinclair et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited. |
spellingShingle | Bioinformatics Sinclair, Lucas Ijaz, Umer Z. Jensen, Lars Juhl Coolen, Marco J.L. Gubry-Rangin, Cecile Chroňáková, Alica Oulas, Anastasis Pavloudi, Christina Schnetzer, Julia Weimann, Aaron Ijaz, Ali Eiler, Alexander Quince, Christopher Pafilis, Evangelos Seqenv: linking sequences to environments through text mining |
title |
Seqenv: linking sequences to environments through text mining |
title_full |
Seqenv: linking sequences to environments through text mining |
title_fullStr |
Seqenv: linking sequences to environments through text mining |
title_full_unstemmed |
Seqenv: linking sequences to environments through text mining |
title_short |
Seqenv: linking sequences to environments through text mining |
title_sort | seqenv: linking sequences to environments through text mining |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5178346/ https://www.ncbi.nlm.nih.gov/pubmed/28028456 http://dx.doi.org/10.7717/peerj.2690 |
work_keys_str_mv | AT sinclairlucas seqenvlinkingsequencestoenvironmentsthroughtextmining AT ijazumerz seqenvlinkingsequencestoenvironmentsthroughtextmining AT jensenlarsjuhl seqenvlinkingsequencestoenvironmentsthroughtextmining AT coolenmarcojl seqenvlinkingsequencestoenvironmentsthroughtextmining AT gubryrangincecile seqenvlinkingsequencestoenvironmentsthroughtextmining AT chronakovaalica seqenvlinkingsequencestoenvironmentsthroughtextmining AT oulasanastasis seqenvlinkingsequencestoenvironmentsthroughtextmining AT pavloudichristina seqenvlinkingsequencestoenvironmentsthroughtextmining AT schnetzerjulia seqenvlinkingsequencestoenvironmentsthroughtextmining AT weimannaaron seqenvlinkingsequencestoenvironmentsthroughtextmining AT ijazali seqenvlinkingsequencestoenvironmentsthroughtextmining AT eileralexander seqenvlinkingsequencestoenvironmentsthroughtextmining AT quincechristopher seqenvlinkingsequencestoenvironmentsthroughtextmining AT pafilisevangelos seqenvlinkingsequencestoenvironmentsthroughtextmining |