Cargando…
Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature sear...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8417071/ https://www.ncbi.nlm.nih.gov/pubmed/34490412 http://dx.doi.org/10.3389/frma.2021.689264 |
_version_ | 1783748311659839488 |
---|---|
author | Lamurias, Andre Jesus, Sofia Neveu, Vanessa Salek, Reza M. Couto, Francisco M. |
author_facet | Lamurias, Andre Jesus, Sofia Neveu, Vanessa Salek, Reza M. Couto, Francisco M. |
author_sort | Lamurias, Andre |
collection | PubMed |
description | Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata. Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases. |
format | Online Article Text |
id | pubmed-8417071 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-84170712021-09-05 Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer Lamurias, Andre Jesus, Sofia Neveu, Vanessa Salek, Reza M. Couto, Francisco M. Front Res Metr Anal Research Metrics and Analytics Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata. Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases. Frontiers Media S.A. 2021-08-19 /pmc/articles/PMC8417071/ /pubmed/34490412 http://dx.doi.org/10.3389/frma.2021.689264 Text en Copyright © 2021 Lamurias, Jesus, Neveu, Salek and Couto. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Research Metrics and Analytics Lamurias, Andre Jesus, Sofia Neveu, Vanessa Salek, Reza M. Couto, Francisco M. Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title | Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title_full | Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title_fullStr | Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title_full_unstemmed | Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title_short | Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer |
title_sort | information retrieval using machine learning for biomarker curation in the exposome-explorer |
topic | Research Metrics and Analytics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8417071/ https://www.ncbi.nlm.nih.gov/pubmed/34490412 http://dx.doi.org/10.3389/frma.2021.689264 |
work_keys_str_mv | AT lamuriasandre informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer AT jesussofia informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer AT neveuvanessa informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer AT salekrezam informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer AT coutofranciscom informationretrievalusingmachinelearningforbiomarkercurationintheexposomeexplorer |