Cargando…

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beck, Tim, Shorter, Tom, Hu, Yan, Li, Zhuoyu, Sun, Shujian, Popovici, Casiana M., McQuibban, Nicholas A. R., Makraduli, Filip, Yeung, Cheng S., Rowlands, Thomas, Posma, Joram M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Digital Health
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8885717/ https://www.ncbi.nlm.nih.gov/pubmed/35243479 http://dx.doi.org/10.3389/fdgth.2022.788124

_version_	1784660501430534144
author	Beck, Tim Shorter, Tom Hu, Yan Li, Zhuoyu Sun, Shujian Popovici, Casiana M. McQuibban, Nicholas A. R. Makraduli, Filip Yeung, Cheng S. Rowlands, Thomas Posma, Joram M.
author_facet	Beck, Tim Shorter, Tom Hu, Yan Li, Zhuoyu Sun, Shujian Popovici, Casiana M. McQuibban, Nicholas A. R. Makraduli, Filip Yeung, Cheng S. Rowlands, Thomas Posma, Joram M.
author_sort	Beck, Tim
collection	PubMed
description	To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.
format	Online Article Text
id	pubmed-8885717
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-88857172022-03-02 Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature Beck, Tim Shorter, Tom Hu, Yan Li, Zhuoyu Sun, Shujian Popovici, Casiana M. McQuibban, Nicholas A. R. Makraduli, Filip Yeung, Cheng S. Rowlands, Thomas Posma, Joram M. Front Digit Health Digital Health To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus. Frontiers Media S.A. 2022-02-15 /pmc/articles/PMC8885717/ /pubmed/35243479 http://dx.doi.org/10.3389/fdgth.2022.788124 Text en Copyright © 2022 Beck, Shorter, Hu, Li, Sun, Popovici, McQuibban, Makraduli, Yeung, Rowlands and Posma. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Digital Health Beck, Tim Shorter, Tom Hu, Yan Li, Zhuoyu Sun, Shujian Popovici, Casiana M. McQuibban, Nicholas A. R. Makraduli, Filip Yeung, Cheng S. Rowlands, Thomas Posma, Joram M. Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title_full	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title_fullStr	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title_full_unstemmed	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title_short	Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
title_sort	auto-corpus: a natural language processing tool for standardizing and reusing biomedical literature
topic	Digital Health
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8885717/ https://www.ncbi.nlm.nih.gov/pubmed/35243479 http://dx.doi.org/10.3389/fdgth.2022.788124
work_keys_str_mv	AT becktim autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT shortertom autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT huyan autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT lizhuoyu autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT sunshujian autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT popovicicasianam autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT mcquibbannicholasar autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT makradulifilip autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT yeungchengs autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT rowlandsthomas autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature AT posmajoramm autocorpusanaturallanguageprocessingtoolforstandardizingandreusingbiomedicalliterature

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Ejemplares similares