Cargando…

OpCitance: Citation contexts identified from the PubMed Central open access articles

OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing...

Descripción completa

Detalles Bibliográficos
Autores principales: Hsiao, Tzu-Kun, Torvik, Vetle I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10139909/
https://www.ncbi.nlm.nih.gov/pubmed/37117220
http://dx.doi.org/10.1038/s41597-023-02134-x
_version_ 1785033049743818752
author Hsiao, Tzu-Kun
Torvik, Vetle I.
author_facet Hsiao, Tzu-Kun
Torvik, Vetle I.
author_sort Hsiao, Tzu-Kun
collection PubMed
description OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly.
format Online
Article
Text
id pubmed-10139909
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-101399092023-04-30 OpCitance: Citation contexts identified from the PubMed Central open access articles Hsiao, Tzu-Kun Torvik, Vetle I. Sci Data Data Descriptor OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly. Nature Publishing Group UK 2023-04-28 /pmc/articles/PMC10139909/ /pubmed/37117220 http://dx.doi.org/10.1038/s41597-023-02134-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Data Descriptor
Hsiao, Tzu-Kun
Torvik, Vetle I.
OpCitance: Citation contexts identified from the PubMed Central open access articles
title OpCitance: Citation contexts identified from the PubMed Central open access articles
title_full OpCitance: Citation contexts identified from the PubMed Central open access articles
title_fullStr OpCitance: Citation contexts identified from the PubMed Central open access articles
title_full_unstemmed OpCitance: Citation contexts identified from the PubMed Central open access articles
title_short OpCitance: Citation contexts identified from the PubMed Central open access articles
title_sort opcitance: citation contexts identified from the pubmed central open access articles
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10139909/
https://www.ncbi.nlm.nih.gov/pubmed/37117220
http://dx.doi.org/10.1038/s41597-023-02134-x
work_keys_str_mv AT hsiaotzukun opcitancecitationcontextsidentifiedfromthepubmedcentralopenaccessarticles
AT torvikvetlei opcitancecitationcontextsidentifiedfromthepubmedcentralopenaccessarticles