Cargando…
Discovering research articles containing evolutionary timetrees by machine learning
MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing ti...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887078/ https://www.ncbi.nlm.nih.gov/pubmed/36648314 http://dx.doi.org/10.1093/bioinformatics/btad035 |
_version_ | 1784880259627220992 |
---|---|
author | Stanojevic, Marija Andjelkovic, Jovan Kasprowicz, Adrienne Huuki, Louise A Chao, Jennifer Hedges, S Blair Kumar, Sudhir Obradovic, Zoran |
author_facet | Stanojevic, Marija Andjelkovic, Jovan Kasprowicz, Adrienne Huuki, Louise A Chao, Jennifer Hedges, S Blair Kumar, Sudhir Obradovic, Zoran |
author_sort | Stanojevic, Marija |
collection | PubMed |
description | MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9887078 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-98870782023-01-31 Discovering research articles containing evolutionary timetrees by machine learning Stanojevic, Marija Andjelkovic, Jovan Kasprowicz, Adrienne Huuki, Louise A Chao, Jennifer Hedges, S Blair Kumar, Sudhir Obradovic, Zoran Bioinformatics Original Paper MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-17 /pmc/articles/PMC9887078/ /pubmed/36648314 http://dx.doi.org/10.1093/bioinformatics/btad035 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Stanojevic, Marija Andjelkovic, Jovan Kasprowicz, Adrienne Huuki, Louise A Chao, Jennifer Hedges, S Blair Kumar, Sudhir Obradovic, Zoran Discovering research articles containing evolutionary timetrees by machine learning |
title | Discovering research articles containing evolutionary timetrees by machine learning |
title_full | Discovering research articles containing evolutionary timetrees by machine learning |
title_fullStr | Discovering research articles containing evolutionary timetrees by machine learning |
title_full_unstemmed | Discovering research articles containing evolutionary timetrees by machine learning |
title_short | Discovering research articles containing evolutionary timetrees by machine learning |
title_sort | discovering research articles containing evolutionary timetrees by machine learning |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887078/ https://www.ncbi.nlm.nih.gov/pubmed/36648314 http://dx.doi.org/10.1093/bioinformatics/btad035 |
work_keys_str_mv | AT stanojevicmarija discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT andjelkovicjovan discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT kasprowiczadrienne discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT huukilouisea discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT chaojennifer discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT hedgessblair discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT kumarsudhir discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning AT obradoviczoran discoveringresearcharticlescontainingevolutionarytimetreesbymachinelearning |