Cargando…

A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 mi...

Descripción completa

Detalles Bibliográficos
Autores principales: Westergaard, David, Stærfeldt, Hans-Henrik, Tønsberg, Christian, Jensen, Lars Juhl, Brunak, Søren
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831415/
https://www.ncbi.nlm.nih.gov/pubmed/29447159
http://dx.doi.org/10.1371/journal.pcbi.1005962
_version_ 1783303158938730496
author Westergaard, David
Stærfeldt, Hans-Henrik
Tønsberg, Christian
Jensen, Lars Juhl
Brunak, Søren
author_facet Westergaard, David
Stærfeldt, Hans-Henrik
Tønsberg, Christian
Jensen, Lars Juhl
Brunak, Søren
author_sort Westergaard, David
collection PubMed
description Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
format Online
Article
Text
id pubmed-5831415
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-58314152018-03-15 A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts Westergaard, David Stærfeldt, Hans-Henrik Tønsberg, Christian Jensen, Lars Juhl Brunak, Søren PLoS Comput Biol Research Article Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. Public Library of Science 2018-02-15 /pmc/articles/PMC5831415/ /pubmed/29447159 http://dx.doi.org/10.1371/journal.pcbi.1005962 Text en © 2018 Westergaard et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Westergaard, David
Stærfeldt, Hans-Henrik
Tønsberg, Christian
Jensen, Lars Juhl
Brunak, Søren
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title_full A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title_fullStr A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title_full_unstemmed A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title_short A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
title_sort comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5831415/
https://www.ncbi.nlm.nih.gov/pubmed/29447159
http://dx.doi.org/10.1371/journal.pcbi.1005962
work_keys_str_mv AT westergaarddavid acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT stærfeldthanshenrik acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT tønsbergchristian acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT jensenlarsjuhl acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT brunaksøren acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT westergaarddavid comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT stærfeldthanshenrik comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT tønsbergchristian comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT jensenlarsjuhl comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT brunaksøren comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts