Cargando…
Rule-based deduplication of article records from bibliographic databases
We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these d...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3893659/ https://www.ncbi.nlm.nih.gov/pubmed/24434031 http://dx.doi.org/10.1093/database/bat086 |
_version_ | 1782299735150821376 |
---|---|
author | Jiang, Yu Lin, Can Meng, Weiyi Yu, Clement Cohen, Aaron M. Smalheiser, Neil R. |
author_facet | Jiang, Yu Lin, Can Meng, Weiyi Yu, Clement Cohen, Aaron M. Smalheiser, Neil R. |
author_sort | Jiang, Yu |
collection | PubMed |
description | We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments. |
format | Online Article Text |
id | pubmed-3893659 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-38936592014-01-16 Rule-based deduplication of article records from bibliographic databases Jiang, Yu Lin, Can Meng, Weiyi Yu, Clement Cohen, Aaron M. Smalheiser, Neil R. Database (Oxford) Original Article We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments. Oxford University Press 2014-01-16 /pmc/articles/PMC3893659/ /pubmed/24434031 http://dx.doi.org/10.1093/database/bat086 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Jiang, Yu Lin, Can Meng, Weiyi Yu, Clement Cohen, Aaron M. Smalheiser, Neil R. Rule-based deduplication of article records from bibliographic databases |
title | Rule-based deduplication of article records from bibliographic databases |
title_full | Rule-based deduplication of article records from bibliographic databases |
title_fullStr | Rule-based deduplication of article records from bibliographic databases |
title_full_unstemmed | Rule-based deduplication of article records from bibliographic databases |
title_short | Rule-based deduplication of article records from bibliographic databases |
title_sort | rule-based deduplication of article records from bibliographic databases |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3893659/ https://www.ncbi.nlm.nih.gov/pubmed/24434031 http://dx.doi.org/10.1093/database/bat086 |
work_keys_str_mv | AT jiangyu rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases AT lincan rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases AT mengweiyi rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases AT yuclement rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases AT cohenaaronm rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases AT smalheiserneilr rulebaseddeduplicationofarticlerecordsfrombibliographicdatabases |