Cargando…

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the expone...

Descripción completa

Detalles Bibliográficos
Autores principales: Bell, Michael J., Gillespie, Colin S., Swan, Daniel, Lord, Phillip
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436799/
https://www.ncbi.nlm.nih.gov/pubmed/22962482
http://dx.doi.org/10.1093/bioinformatics/bts372
_version_ 1782242700262637568
author Bell, Michael J.
Gillespie, Colin S.
Swan, Daniel
Lord, Phillip
author_facet Bell, Michael J.
Gillespie, Colin S.
Swan, Daniel
Lord, Phillip
author_sort Bell, Michael J.
collection PubMed
description Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. Contact: phillip.lord@newcastle.ac.uk
format Online
Article
Text
id pubmed-3436799
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-34367992012-12-12 An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB Bell, Michael J. Gillespie, Colin S. Swan, Daniel Lord, Phillip Bioinformatics Original Papers Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. Contact: phillip.lord@newcastle.ac.uk Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436799/ /pubmed/22962482 http://dx.doi.org/10.1093/bioinformatics/bts372 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Bell, Michael J.
Gillespie, Colin S.
Swan, Daniel
Lord, Phillip
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title_full An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title_fullStr An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title_full_unstemmed An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title_short An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
title_sort approach to describing and analysing bulk biological annotation quality: a case study using uniprotkb
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436799/
https://www.ncbi.nlm.nih.gov/pubmed/22962482
http://dx.doi.org/10.1093/bioinformatics/bts372
work_keys_str_mv AT bellmichaelj anapproachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT gillespiecolins anapproachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT swandaniel anapproachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT lordphillip anapproachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT bellmichaelj approachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT gillespiecolins approachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT swandaniel approachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb
AT lordphillip approachtodescribingandanalysingbulkbiologicalannotationqualityacasestudyusinguniprotkb