Cargando…

The language of gene ontology: a Zipf’s law analysis

BACKGROUND: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf’s law,...

Descripción completa

Detalles Bibliográficos
Autores principales: Kalankesh, Leila Ranandeh, Stevens, Robert, Brass, Andy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3473240/
https://www.ncbi.nlm.nih.gov/pubmed/22676436
http://dx.doi.org/10.1186/1471-2105-13-127
_version_ 1782246732128583680
author Kalankesh, Leila Ranandeh
Stevens, Robert
Brass, Andy
author_facet Kalankesh, Leila Ranandeh
Stevens, Robert
Brass, Andy
author_sort Kalankesh, Leila Ranandeh
collection PubMed
description BACKGROUND: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf’s law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. RESULTS: Annotations from the Gene Ontology Annotation project were found to follow Zipf’s law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. CONCLUSIONS: Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.
format Online
Article
Text
id pubmed-3473240
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34732402012-10-23 The language of gene ontology: a Zipf’s law analysis Kalankesh, Leila Ranandeh Stevens, Robert Brass, Andy BMC Bioinformatics Methodology Article BACKGROUND: Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf’s law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. RESULTS: Annotations from the Gene Ontology Annotation project were found to follow Zipf’s law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. CONCLUSIONS: Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation. BioMed Central 2012-06-07 /pmc/articles/PMC3473240/ /pubmed/22676436 http://dx.doi.org/10.1186/1471-2105-13-127 Text en Copyright ©2012 Kalankesh et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kalankesh, Leila Ranandeh
Stevens, Robert
Brass, Andy
The language of gene ontology: a Zipf’s law analysis
title The language of gene ontology: a Zipf’s law analysis
title_full The language of gene ontology: a Zipf’s law analysis
title_fullStr The language of gene ontology: a Zipf’s law analysis
title_full_unstemmed The language of gene ontology: a Zipf’s law analysis
title_short The language of gene ontology: a Zipf’s law analysis
title_sort language of gene ontology: a zipf’s law analysis
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3473240/
https://www.ncbi.nlm.nih.gov/pubmed/22676436
http://dx.doi.org/10.1186/1471-2105-13-127
work_keys_str_mv AT kalankeshleilaranandeh thelanguageofgeneontologyazipfslawanalysis
AT stevensrobert thelanguageofgeneontologyazipfslawanalysis
AT brassandy thelanguageofgeneontologyazipfslawanalysis
AT kalankeshleilaranandeh languageofgeneontologyazipfslawanalysis
AT stevensrobert languageofgeneontologyazipfslawanalysis
AT brassandy languageofgeneontologyazipfslawanalysis