Cargando…

Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer

Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature sear...

Descripción completa

Detalles Bibliográficos
Autores principales: Lamurias, Andre, Jesus, Sofia, Neveu, Vanessa, Salek, Reza M., Couto, Francisco M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8417071/
https://www.ncbi.nlm.nih.gov/pubmed/34490412
http://dx.doi.org/10.3389/frma.2021.689264
_version_ 1783748311659839488
author Lamurias, Andre
Jesus, Sofia
Neveu, Vanessa
Salek, Reza M.
Couto, Francisco M.
author_facet Lamurias, Andre
Jesus, Sofia
Neveu, Vanessa
Salek, Reza M.
Couto, Francisco M.
author_sort Lamurias, Andre
collection PubMed
description Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata. Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.
format Online
Article
Text
id pubmed-8417071
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-84170712021-09-05 Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer Lamurias, Andre Jesus, Sofia Neveu, Vanessa Salek, Reza M. Couto, Francisco M. Front Res Metr Anal Research Metrics and Analytics Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata. Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases. Frontiers Media S.A. 2021-08-19 /pmc/articles/PMC8417071/ /pubmed/34490412 http://dx.doi.org/10.3389/frma.2021.689264 Text en Copyright © 2021 Lamurias, Jesus, Neveu, Salek and Couto. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Research Metrics and Analytics
Lamurias, Andre
Jesus, Sofia
Neveu, Vanessa
Salek, Reza M.
Couto, Francisco M.
Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title_full Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title_fullStr Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title_full_unstemmed Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title_short Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
title_sort information retrieval using machine learning for biomarker curation in the exposome-explorer
topic Research Metrics and Analytics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8417071/
https://www.ncbi.nlm.nih.gov/pubmed/34490412
http://dx.doi.org/10.3389/frma.2021.689264
work_keys_str_mv AT lamuriasandre informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer
AT jesussofia informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer
AT neveuvanessa informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer
AT salekrezam informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer
AT coutofranciscom informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer