Cargando…

Discovering research articles containing evolutionary timetrees by machine learning

MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing ti...

Descripción completa

Detalles Bibliográficos
Autores principales: Stanojevic, Marija, Andjelkovic, Jovan, Kasprowicz, Adrienne, Huuki, Louise A, Chao, Jennifer, Hedges, S Blair, Kumar, Sudhir, Obradovic, Zoran
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887078/
https://www.ncbi.nlm.nih.gov/pubmed/36648314
http://dx.doi.org/10.1093/bioinformatics/btad035
_version_ 1784880259627220992
author Stanojevic, Marija
Andjelkovic, Jovan
Kasprowicz, Adrienne
Huuki, Louise A
Chao, Jennifer
Hedges, S Blair
Kumar, Sudhir
Obradovic, Zoran
author_facet Stanojevic, Marija
Andjelkovic, Jovan
Kasprowicz, Adrienne
Huuki, Louise A
Chao, Jennifer
Hedges, S Blair
Kumar, Sudhir
Obradovic, Zoran
author_sort Stanojevic, Marija
collection PubMed
description MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9887078
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98870782023-01-31 Discovering research articles containing evolutionary timetrees by machine learning Stanojevic, Marija Andjelkovic, Jovan Kasprowicz, Adrienne Huuki, Louise A Chao, Jennifer Hedges, S Blair Kumar, Sudhir Obradovic, Zoran Bioinformatics Original Paper MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-17 /pmc/articles/PMC9887078/ /pubmed/36648314 http://dx.doi.org/10.1093/bioinformatics/btad035 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Stanojevic, Marija
Andjelkovic, Jovan
Kasprowicz, Adrienne
Huuki, Louise A
Chao, Jennifer
Hedges, S Blair
Kumar, Sudhir
Obradovic, Zoran
Discovering research articles containing evolutionary timetrees by machine learning
title Discovering research articles containing evolutionary timetrees by machine learning
title_full Discovering research articles containing evolutionary timetrees by machine learning
title_fullStr Discovering research articles containing evolutionary timetrees by machine learning
title_full_unstemmed Discovering research articles containing evolutionary timetrees by machine learning
title_short Discovering research articles containing evolutionary timetrees by machine learning
title_sort discovering research articles containing evolutionary timetrees by machine learning
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887078/
https://www.ncbi.nlm.nih.gov/pubmed/36648314
http://dx.doi.org/10.1093/bioinformatics/btad035
work_keys_str_mv AT stanojevicmarija discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT andjelkovicjovan discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT kasprowiczadrienne discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT huukilouisea discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT chaojennifer discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT hedgessblair discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT kumarsudhir discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning
AT obradoviczoran discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning