Cargando…

Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources

As an important biomedical database, PubMed provides users with free access to abstracts of its documents. However, citations between these documents need to be collected from external data sources. Although previous studies have investigated the coverage of various data sources, the quality of cita...

Descripción completa

Detalles Bibliográficos
Autores principales: Liang, Zhentao, Mao, Jin, Lu, Kun, Li, Gang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8542188/
https://www.ncbi.nlm.nih.gov/pubmed/34720252
http://dx.doi.org/10.1007/s11192-021-04191-8
_version_ 1784589390772699136
author Liang, Zhentao
Mao, Jin
Lu, Kun
Li, Gang
author_facet Liang, Zhentao
Mao, Jin
Lu, Kun
Li, Gang
author_sort Liang, Zhentao
collection PubMed
description As an important biomedical database, PubMed provides users with free access to abstracts of its documents. However, citations between these documents need to be collected from external data sources. Although previous studies have investigated the coverage of various data sources, the quality of citations is underexplored. In response, this study compares the coverage and citation quality of five freely available data sources on 30 million PubMed documents, including OpenCitations Index of CrossRef open DOI-to-DOI citations (COCI), Dimensions, Microsoft Academic Graph (MAG), National Institutes of Health’s Open Citation Collection (NIH-OCC), and Semantic Scholar Open Research Corpus (S2ORC). Three gold standards and five metrics are introduced to evaluate the correctness and completeness of citations. Our results indicate that Dimensions is the most comprehensive data source that provides references for 62.4% of PubMed documents, outperforming the official NIH-OCC dataset (56.7%). Over 90% of citation links in other data sources can also be found in Dimensions. The coverage of MAG, COCI, and S2ORC is 59.6%, 34.7%, and 23.5%, respectively. Regarding the citation quality, Dimensions and NIH-OCC achieve the best overall results. Almost all data sources have a precision higher than 90%, but their recall is much lower. All databases have better performances on recent publications than earlier ones. Meanwhile, the gaps between different data sources have diminished for the documents published in recent years. This study provides evidence for researchers to choose suitable PubMed citation sources, which is also helpful for evaluating the citation quality of free bibliographic databases.
format Online
Article
Text
id pubmed-8542188
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-85421882021-10-25 Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources Liang, Zhentao Mao, Jin Lu, Kun Li, Gang Scientometrics Article As an important biomedical database, PubMed provides users with free access to abstracts of its documents. However, citations between these documents need to be collected from external data sources. Although previous studies have investigated the coverage of various data sources, the quality of citations is underexplored. In response, this study compares the coverage and citation quality of five freely available data sources on 30 million PubMed documents, including OpenCitations Index of CrossRef open DOI-to-DOI citations (COCI), Dimensions, Microsoft Academic Graph (MAG), National Institutes of Health’s Open Citation Collection (NIH-OCC), and Semantic Scholar Open Research Corpus (S2ORC). Three gold standards and five metrics are introduced to evaluate the correctness and completeness of citations. Our results indicate that Dimensions is the most comprehensive data source that provides references for 62.4% of PubMed documents, outperforming the official NIH-OCC dataset (56.7%). Over 90% of citation links in other data sources can also be found in Dimensions. The coverage of MAG, COCI, and S2ORC is 59.6%, 34.7%, and 23.5%, respectively. Regarding the citation quality, Dimensions and NIH-OCC achieve the best overall results. Almost all data sources have a precision higher than 90%, but their recall is much lower. All databases have better performances on recent publications than earlier ones. Meanwhile, the gaps between different data sources have diminished for the documents published in recent years. This study provides evidence for researchers to choose suitable PubMed citation sources, which is also helpful for evaluating the citation quality of free bibliographic databases. Springer International Publishing 2021-10-24 2021 /pmc/articles/PMC8542188/ /pubmed/34720252 http://dx.doi.org/10.1007/s11192-021-04191-8 Text en © Akadémiai Kiadó, Budapest, Hungary 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Liang, Zhentao
Mao, Jin
Lu, Kun
Li, Gang
Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title_full Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title_fullStr Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title_full_unstemmed Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title_short Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources
title_sort finding citations for pubmed: a large-scale comparison between five freely available bibliographic data sources
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8542188/
https://www.ncbi.nlm.nih.gov/pubmed/34720252
http://dx.doi.org/10.1007/s11192-021-04191-8
work_keys_str_mv AT liangzhentao findingcitationsforpubmedalargescalecomparisonbetweenfivefreelyavailablebibliographicdatasources
AT maojin findingcitationsforpubmedalargescalecomparisonbetweenfivefreelyavailablebibliographicdatasources
AT lukun findingcitationsforpubmedalargescalecomparisonbetweenfivefreelyavailablebibliographicdatasources
AT ligang findingcitationsforpubmedalargescalecomparisonbetweenfivefreelyavailablebibliographicdatasources