Cargando…

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

[Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by softw...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Miao, Cole, Jacqueline M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Chemical Society 2022
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9049592/ https://www.ncbi.nlm.nih.gov/pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198

_version_	1784696172897632256
author	Zhu, Miao Cole, Jacqueline M.
author_facet	Zhu, Miao Cole, Jacqueline M.
author_sort	Zhu, Miao
collection	PubMed
description	[Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.
format	Online Article Text
id	pubmed-9049592
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	American Chemical Society
record_format	MEDLINE/PubMed
spelling	pubmed-90495922022-04-29 PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format Zhu, Miao Cole, Jacqueline M. J Chem Inf Model [Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text. American Chemical Society 2022-03-29 2022-04-11 /pmc/articles/PMC9049592/ /pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198 Text en © 2022 American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Zhu, Miao Cole, Jacqueline M. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title	PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title_full	PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title_fullStr	PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title_full_unstemmed	PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title_short	PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
title_sort	pdfdataextractor: a tool for reading scientific text and interpreting metadata from the typeset literature in the portable document format
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9049592/ https://www.ncbi.nlm.nih.gov/pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198
work_keys_str_mv	AT zhumiao pdfdataextractoratoolforreadingscientifictextandinterpretingmetadatafromthetypesetliteratureintheportabledocumentformat AT colejacquelinem pdfdataextractoratoolforreadingscientifictextandinterpretingmetadatafromthetypesetliteratureintheportabledocumentformat

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

Ejemplares similares