Cargando…
PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format
[Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by softw...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Chemical Society
2022
|
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9049592/ https://www.ncbi.nlm.nih.gov/pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198 |
_version_ | 1784696172897632256 |
---|---|
author | Zhu, Miao Cole, Jacqueline M. |
author_facet | Zhu, Miao Cole, Jacqueline M. |
author_sort | Zhu, Miao |
collection | PubMed |
description | [Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text. |
format | Online Article Text |
id | pubmed-9049592 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | American Chemical Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-90495922022-04-29 PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format Zhu, Miao Cole, Jacqueline M. J Chem Inf Model [Image: see text] The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text. American Chemical Society 2022-03-29 2022-04-11 /pmc/articles/PMC9049592/ /pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198 Text en © 2022 American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Zhu, Miao Cole, Jacqueline M. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format |
title | PDFDataExtractor: A Tool for Reading Scientific Text
and Interpreting Metadata from the Typeset Literature in the Portable
Document Format |
title_full | PDFDataExtractor: A Tool for Reading Scientific Text
and Interpreting Metadata from the Typeset Literature in the Portable
Document Format |
title_fullStr | PDFDataExtractor: A Tool for Reading Scientific Text
and Interpreting Metadata from the Typeset Literature in the Portable
Document Format |
title_full_unstemmed | PDFDataExtractor: A Tool for Reading Scientific Text
and Interpreting Metadata from the Typeset Literature in the Portable
Document Format |
title_short | PDFDataExtractor: A Tool for Reading Scientific Text
and Interpreting Metadata from the Typeset Literature in the Portable
Document Format |
title_sort | pdfdataextractor: a tool for reading scientific text
and interpreting metadata from the typeset literature in the portable
document format |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9049592/ https://www.ncbi.nlm.nih.gov/pubmed/35349259 http://dx.doi.org/10.1021/acs.jcim.1c01198 |
work_keys_str_mv | AT zhumiao pdfdataextractoratoolforreadingscientifictextandinterpretingmetadatafromthetypesetliteratureintheportabledocumentformat AT colejacquelinem pdfdataextractoratoolforreadingscientifictextandinterpretingmetadatafromthetypesetliteratureintheportabledocumentformat |