Cargando…

Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production

Abstract. Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex re...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cui, Hong, Macklin, James A., Sachs, Joel, Reznicek, Anton, Starr, Julian, Ford, Bruce, Penev, Lyubomir, Chen, Hsin-Liang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Pensoft Publishers 2018
Materias:	Forum Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6235995/ https://www.ncbi.nlm.nih.gov/pubmed/30473620 http://dx.doi.org/10.3897/BDJ.6.e29616

_version_	1783370949444239360
author	Cui, Hong Macklin, James A. Sachs, Joel Reznicek, Anton Starr, Julian Ford, Bruce Penev, Lyubomir Chen, Hsin-Liang
author_facet	Cui, Hong Macklin, James A. Sachs, Joel Reznicek, Anton Starr, Julian Ford, Bruce Penev, Lyubomir Chen, Hsin-Liang
author_sort	Cui, Hong
collection	PubMed
description	Abstract. Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project’s semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
format	Online Article Text
id	pubmed-6235995
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Pensoft Publishers
record_format	MEDLINE/PubMed
spelling	pubmed-62359952018-11-23 Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production Cui, Hong Macklin, James A. Sachs, Joel Reznicek, Anton Starr, Julian Ford, Bruce Penev, Lyubomir Chen, Hsin-Liang Biodivers Data J Forum Paper Abstract. Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project’s semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production. Pensoft Publishers 2018-11-07 /pmc/articles/PMC6235995/ /pubmed/30473620 http://dx.doi.org/10.3897/BDJ.6.e29616 Text en https://creativecommons.org/share-your-work/public-domain/cc0/ This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
spellingShingle	Forum Paper Cui, Hong Macklin, James A. Sachs, Joel Reznicek, Anton Starr, Julian Ford, Bruce Penev, Lyubomir Chen, Hsin-Liang Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title	Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title_full	Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title_fullStr	Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title_full_unstemmed	Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title_short	Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production
title_sort	incentivising use of structured language in biological descriptions: author-driven phenotype data and ontology production
topic	Forum Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6235995/ https://www.ncbi.nlm.nih.gov/pubmed/30473620 http://dx.doi.org/10.3897/BDJ.6.e29616
work_keys_str_mv	AT cuihong incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT macklinjamesa incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT sachsjoel incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT reznicekanton incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT starrjulian incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT fordbruce incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT penevlyubomir incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction AT chenhsinliang incentivisinguseofstructuredlanguageinbiologicaldescriptionsauthordrivenphenotypedataandontologyproduction

Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production

Ejemplares similares