Cargando…

Tools for loading MEDLINE into a local relational database

BACKGROUND: Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files,...

Descripción completa

Detalles Bibliográficos
Autores principales: Oliver, Diane E, Bhalotia, Gaurav, Schwartz, Ariel S, Altman, Russ B, Hearst, Marti A
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2004
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC524480/
https://www.ncbi.nlm.nih.gov/pubmed/15471541
http://dx.doi.org/10.1186/1471-2105-5-146
_version_ 1782121901366181888
author Oliver, Diane E
Bhalotia, Gaurav
Schwartz, Ariel S
Altman, Russ B
Hearst, Marti A
author_facet Oliver, Diane E
Bhalotia, Gaurav
Schwartz, Ariel S
Altman, Russ B
Hearst, Marti A
author_sort Oliver, Diane E
collection PubMed
description BACKGROUND: Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers. RESULTS: We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated. CONCLUSIONS: Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at .
format Text
id pubmed-524480
institution National Center for Biotechnology Information
language English
publishDate 2004
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-5244802004-10-31 Tools for loading MEDLINE into a local relational database Oliver, Diane E Bhalotia, Gaurav Schwartz, Ariel S Altman, Russ B Hearst, Marti A BMC Bioinformatics Software BACKGROUND: Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers. RESULTS: We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated. CONCLUSIONS: Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at . BioMed Central 2004-10-07 /pmc/articles/PMC524480/ /pubmed/15471541 http://dx.doi.org/10.1186/1471-2105-5-146 Text en Copyright © 2004 Oliver et al; licensee BioMed Central Ltd.
spellingShingle Software
Oliver, Diane E
Bhalotia, Gaurav
Schwartz, Ariel S
Altman, Russ B
Hearst, Marti A
Tools for loading MEDLINE into a local relational database
title Tools for loading MEDLINE into a local relational database
title_full Tools for loading MEDLINE into a local relational database
title_fullStr Tools for loading MEDLINE into a local relational database
title_full_unstemmed Tools for loading MEDLINE into a local relational database
title_short Tools for loading MEDLINE into a local relational database
title_sort tools for loading medline into a local relational database
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC524480/
https://www.ncbi.nlm.nih.gov/pubmed/15471541
http://dx.doi.org/10.1186/1471-2105-5-146
work_keys_str_mv AT oliverdianee toolsforloadingmedlineintoalocalrelationaldatabase
AT bhalotiagaurav toolsforloadingmedlineintoalocalrelationaldatabase
AT schwartzariels toolsforloadingmedlineintoalocalrelationaldatabase
AT altmanrussb toolsforloadingmedlineintoalocalrelationaldatabase
AT hearstmartia toolsforloadingmedlineintoalocalrelationaldatabase