Cargando…
OpCitance: Citation contexts identified from the PubMed Central open access articles
OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10139909/ https://www.ncbi.nlm.nih.gov/pubmed/37117220 http://dx.doi.org/10.1038/s41597-023-02134-x |
_version_ | 1785033049743818752 |
---|---|
author | Hsiao, Tzu-Kun Torvik, Vetle I. |
author_facet | Hsiao, Tzu-Kun Torvik, Vetle I. |
author_sort | Hsiao, Tzu-Kun |
collection | PubMed |
description | OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly. |
format | Online Article Text |
id | pubmed-10139909 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-101399092023-04-30 OpCitance: Citation contexts identified from the PubMed Central open access articles Hsiao, Tzu-Kun Torvik, Vetle I. Sci Data Data Descriptor OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly. Nature Publishing Group UK 2023-04-28 /pmc/articles/PMC10139909/ /pubmed/37117220 http://dx.doi.org/10.1038/s41597-023-02134-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Data Descriptor Hsiao, Tzu-Kun Torvik, Vetle I. OpCitance: Citation contexts identified from the PubMed Central open access articles |
title | OpCitance: Citation contexts identified from the PubMed Central open access articles |
title_full | OpCitance: Citation contexts identified from the PubMed Central open access articles |
title_fullStr | OpCitance: Citation contexts identified from the PubMed Central open access articles |
title_full_unstemmed | OpCitance: Citation contexts identified from the PubMed Central open access articles |
title_short | OpCitance: Citation contexts identified from the PubMed Central open access articles |
title_sort | opcitance: citation contexts identified from the pubmed central open access articles |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10139909/ https://www.ncbi.nlm.nih.gov/pubmed/37117220 http://dx.doi.org/10.1038/s41597-023-02134-x |
work_keys_str_mv | AT hsiaotzukun opcitancecitationcontextsidentifiedfromthepubmedcentralopenaccessarticles AT torvikvetlei opcitancecitationcontextsidentifiedfromthepubmedcentralopenaccessarticles |