Cargando…

Curation accuracy of model organism databases

Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. B...

Descripción completa

Detalles Bibliográficos
Autores principales: Keseler, Ingrid M., Skrzypek, Marek, Weerasinghe, Deepika, Chen, Albert Y., Fulcher, Carol, Li, Gene-Wei, Lemmer, Kimberly C., Mladinich, Katherine M., Chow, Edmond D., Sherlock, Gavin, Karp, Peter D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207230/
https://www.ncbi.nlm.nih.gov/pubmed/24923819
http://dx.doi.org/10.1093/database/bau058
_version_ 1782340938772774912
author Keseler, Ingrid M.
Skrzypek, Marek
Weerasinghe, Deepika
Chen, Albert Y.
Fulcher, Carol
Li, Gene-Wei
Lemmer, Kimberly C.
Mladinich, Katherine M.
Chow, Edmond D.
Sherlock, Gavin
Karp, Peter D.
author_facet Keseler, Ingrid M.
Skrzypek, Marek
Weerasinghe, Deepika
Chen, Albert Y.
Fulcher, Carol
Li, Gene-Wei
Lemmer, Kimberly C.
Mladinich, Katherine M.
Chow, Edmond D.
Sherlock, Gavin
Karp, Peter D.
author_sort Keseler, Ingrid M.
collection PubMed
description Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org//
format Online
Article
Text
id pubmed-4207230
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-42072302014-10-28 Curation accuracy of model organism databases Keseler, Ingrid M. Skrzypek, Marek Weerasinghe, Deepika Chen, Albert Y. Fulcher, Carol Li, Gene-Wei Lemmer, Kimberly C. Mladinich, Katherine M. Chow, Edmond D. Sherlock, Gavin Karp, Peter D. Database (Oxford) Original Article Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org// Oxford University Press 2014-06-12 /pmc/articles/PMC4207230/ /pubmed/24923819 http://dx.doi.org/10.1093/database/bau058 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Keseler, Ingrid M.
Skrzypek, Marek
Weerasinghe, Deepika
Chen, Albert Y.
Fulcher, Carol
Li, Gene-Wei
Lemmer, Kimberly C.
Mladinich, Katherine M.
Chow, Edmond D.
Sherlock, Gavin
Karp, Peter D.
Curation accuracy of model organism databases
title Curation accuracy of model organism databases
title_full Curation accuracy of model organism databases
title_fullStr Curation accuracy of model organism databases
title_full_unstemmed Curation accuracy of model organism databases
title_short Curation accuracy of model organism databases
title_sort curation accuracy of model organism databases
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4207230/
https://www.ncbi.nlm.nih.gov/pubmed/24923819
http://dx.doi.org/10.1093/database/bau058
work_keys_str_mv AT keseleringridm curationaccuracyofmodelorganismdatabases
AT skrzypekmarek curationaccuracyofmodelorganismdatabases
AT weerasinghedeepika curationaccuracyofmodelorganismdatabases
AT chenalberty curationaccuracyofmodelorganismdatabases
AT fulchercarol curationaccuracyofmodelorganismdatabases
AT ligenewei curationaccuracyofmodelorganismdatabases
AT lemmerkimberlyc curationaccuracyofmodelorganismdatabases
AT mladinichkatherinem curationaccuracyofmodelorganismdatabases
AT chowedmondd curationaccuracyofmodelorganismdatabases
AT sherlockgavin curationaccuracyofmodelorganismdatabases
AT karppeterd curationaccuracyofmodelorganismdatabases