Cargando…

A machine learning-enabled open biodata resource inventory from the scientific literature

Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has...

Descripción completa

Detalles Bibliográficos
Autores principales:	Imker, Heidi J., Schackart, Kenneth E., Istrate, Ana-Maria, Cook, Charles E.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10684096/ https://www.ncbi.nlm.nih.gov/pubmed/38015968 http://dx.doi.org/10.1371/journal.pone.0294812

_version_	1785151326264492032
author	Imker, Heidi J. Schackart, Kenneth E. Istrate, Ana-Maria Cook, Charles E.
author_facet	Imker, Heidi J. Schackart, Kenneth E. Istrate, Ana-Maria Cook, Charles E.
author_sort	Imker, Heidi J.
collection	PubMed
description	Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011–2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT).
format	Online Article Text
id	pubmed-10684096
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-106840962023-11-30 A machine learning-enabled open biodata resource inventory from the scientific literature Imker, Heidi J. Schackart, Kenneth E. Istrate, Ana-Maria Cook, Charles E. PLoS One Research Article Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011–2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT). Public Library of Science 2023-11-28 /pmc/articles/PMC10684096/ /pubmed/38015968 http://dx.doi.org/10.1371/journal.pone.0294812 Text en © 2023 Imker et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Imker, Heidi J. Schackart, Kenneth E. Istrate, Ana-Maria Cook, Charles E. A machine learning-enabled open biodata resource inventory from the scientific literature
title	A machine learning-enabled open biodata resource inventory from the scientific literature
title_full	A machine learning-enabled open biodata resource inventory from the scientific literature
title_fullStr	A machine learning-enabled open biodata resource inventory from the scientific literature
title_full_unstemmed	A machine learning-enabled open biodata resource inventory from the scientific literature
title_short	A machine learning-enabled open biodata resource inventory from the scientific literature
title_sort	machine learning-enabled open biodata resource inventory from the scientific literature
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10684096/ https://www.ncbi.nlm.nih.gov/pubmed/38015968 http://dx.doi.org/10.1371/journal.pone.0294812
work_keys_str_mv	AT imkerheidij amachinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT schackartkennethe amachinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT istrateanamaria amachinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT cookcharlese amachinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT imkerheidij machinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT schackartkennethe machinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT istrateanamaria machinelearningenabledopenbiodataresourceinventoryfromthescientificliterature AT cookcharlese machinelearningenabledopenbiodataresourceinventoryfromthescientificliterature

A machine learning-enabled open biodata resource inventory from the scientific literature

Ejemplares similares