Cargando…

A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications

Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the env...

Descripción completa

Detalles Bibliográficos
Autores principales: Nassar, Maaly, Rogers, Alexander B, Talo', Francesco, Sanchez, Santiago, Shafique, Zunaira, Finn, Robert D, McEntyre, Johanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366992/
https://www.ncbi.nlm.nih.gov/pubmed/35950838
http://dx.doi.org/10.1093/gigascience/giac077
_version_ 1784765690580828160
author Nassar, Maaly
Rogers, Alexander B
Talo', Francesco
Sanchez, Santiago
Shafique, Zunaira
Finn, Robert D
McEntyre, Johanna
author_facet Nassar, Maaly
Rogers, Alexander B
Talo', Francesco
Sanchez, Santiago
Shafique, Zunaira
Finn, Robert D
McEntyre, Johanna
author_sort Nassar, Maaly
collection PubMed
description Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.
format Online
Article
Text
id pubmed-9366992
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-93669922022-08-12 A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications Nassar, Maaly Rogers, Alexander B Talo', Francesco Sanchez, Santiago Shafique, Zunaira Finn, Robert D McEntyre, Johanna Gigascience Research Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API. Oxford University Press 2022-08-11 /pmc/articles/PMC9366992/ /pubmed/35950838 http://dx.doi.org/10.1093/gigascience/giac077 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Nassar, Maaly
Rogers, Alexander B
Talo', Francesco
Sanchez, Santiago
Shafique, Zunaira
Finn, Robert D
McEntyre, Johanna
A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title_full A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title_fullStr A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title_full_unstemmed A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title_short A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
title_sort machine learning framework for discovery and enrichment of metagenomics metadata from open access publications
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366992/
https://www.ncbi.nlm.nih.gov/pubmed/35950838
http://dx.doi.org/10.1093/gigascience/giac077
work_keys_str_mv AT nassarmaaly amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT rogersalexanderb amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT talofrancesco amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT sanchezsantiago amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT shafiquezunaira amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT finnrobertd amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT mcentyrejohanna amachinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT nassarmaaly machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT rogersalexanderb machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT talofrancesco machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT sanchezsantiago machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT shafiquezunaira machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT finnrobertd machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications
AT mcentyrejohanna machinelearningframeworkfordiscoveryandenrichmentofmetagenomicsmetadatafromopenaccesspublications