Cargando…

Combining natural language processing and metabarcoding to reveal pathogen-environment associations

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecologic...

Descripción completa

Detalles Bibliográficos
Autores principales: Molik, David C., Tomlinson, DeAndre, Davitt, Shane, Morgan, Eric L., Sisk, Matthew, Roche, Benjamin, Meyers, Natalie, Pfrender, Michael E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055023/
https://www.ncbi.nlm.nih.gov/pubmed/33826634
http://dx.doi.org/10.1371/journal.pntd.0008755
_version_ 1783680383404998656
author Molik, David C.
Tomlinson, DeAndre
Davitt, Shane
Morgan, Eric L.
Sisk, Matthew
Roche, Benjamin
Meyers, Natalie
Pfrender, Michael E.
author_facet Molik, David C.
Tomlinson, DeAndre
Davitt, Shane
Morgan, Eric L.
Sisk, Matthew
Roche, Benjamin
Meyers, Natalie
Pfrender, Michael E.
author_sort Molik, David C.
collection PubMed
description Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.
format Online
Article
Text
id pubmed-8055023
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-80550232021-04-30 Combining natural language processing and metabarcoding to reveal pathogen-environment associations Molik, David C. Tomlinson, DeAndre Davitt, Shane Morgan, Eric L. Sisk, Matthew Roche, Benjamin Meyers, Natalie Pfrender, Michael E. PLoS Negl Trop Dis Research Article Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations. Public Library of Science 2021-04-07 /pmc/articles/PMC8055023/ /pubmed/33826634 http://dx.doi.org/10.1371/journal.pntd.0008755 Text en © 2021 Molik et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Molik, David C.
Tomlinson, DeAndre
Davitt, Shane
Morgan, Eric L.
Sisk, Matthew
Roche, Benjamin
Meyers, Natalie
Pfrender, Michael E.
Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title_full Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title_fullStr Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title_full_unstemmed Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title_short Combining natural language processing and metabarcoding to reveal pathogen-environment associations
title_sort combining natural language processing and metabarcoding to reveal pathogen-environment associations
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055023/
https://www.ncbi.nlm.nih.gov/pubmed/33826634
http://dx.doi.org/10.1371/journal.pntd.0008755
work_keys_str_mv AT molikdavidc combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT tomlinsondeandre combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT davittshane combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT morganericl combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT siskmatthew combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT rochebenjamin combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT meyersnatalie combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations
AT pfrendermichaele combiningnaturallanguageprocessingandmetabarcodingtorevealpathogenenvironmentassociations