Cargando…

SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data

There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, whic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pang, Chao, Sollie, Annet, Sijtsma, Anna, Hendriksen, Dennis, Charbon, Bart, de Haan, Mark, de Boer, Tommy, Kelpin, Fleur, Jetten, Jonathan, van der Velde, Joeri K., Smidt, Nynke, Sijmons, Rolf, Hillege, Hans, Swertz, Morris A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2015
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4574036/ https://www.ncbi.nlm.nih.gov/pubmed/26385205 http://dx.doi.org/10.1093/database/bav089

_version_	1782390554692157440
author	Pang, Chao Sollie, Annet Sijtsma, Anna Hendriksen, Dennis Charbon, Bart de Haan, Mark de Boer, Tommy Kelpin, Fleur Jetten, Jonathan van der Velde, Joeri K. Smidt, Nynke Sijmons, Rolf Hillege, Hans Swertz, Morris A.
author_facet	Pang, Chao Sollie, Annet Sijtsma, Anna Hendriksen, Dennis Charbon, Bart de Haan, Mark de Boer, Tommy Kelpin, Fleur Jetten, Jonathan van der Velde, Joeri K. Smidt, Nynke Sijmons, Rolf Hillege, Hans Swertz, Morris A.
author_sort	Pang, Chao
collection	PubMed
description	There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA’s applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects. Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA
format	Online Article Text
id	pubmed-4574036
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-45740362015-09-21 SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data Pang, Chao Sollie, Annet Sijtsma, Anna Hendriksen, Dennis Charbon, Bart de Haan, Mark de Boer, Tommy Kelpin, Fleur Jetten, Jonathan van der Velde, Joeri K. Smidt, Nynke Sijmons, Rolf Hillege, Hans Swertz, Morris A. Database (Oxford) Original Article There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA’s applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects. Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA Oxford University Press 2015-09-17 /pmc/articles/PMC4574036/ /pubmed/26385205 http://dx.doi.org/10.1093/database/bav089 Text en © The Author(s) 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Pang, Chao Sollie, Annet Sijtsma, Anna Hendriksen, Dennis Charbon, Bart de Haan, Mark de Boer, Tommy Kelpin, Fleur Jetten, Jonathan van der Velde, Joeri K. Smidt, Nynke Sijmons, Rolf Hillege, Hans Swertz, Morris A. SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title	SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title_full	SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title_fullStr	SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title_full_unstemmed	SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title_short	SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
title_sort	sorta: a system for ontology-based re-coding and technical annotation of biomedical phenotype data
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4574036/ https://www.ncbi.nlm.nih.gov/pubmed/26385205 http://dx.doi.org/10.1093/database/bav089
work_keys_str_mv	AT pangchao sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT sollieannet sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT sijtsmaanna sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT hendriksendennis sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT charbonbart sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT dehaanmark sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT deboertommy sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT kelpinfleur sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT jettenjonathan sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT vanderveldejoerik sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT smidtnynke sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT sijmonsrolf sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT hillegehans sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata AT swertzmorrisa sortaasystemforontologybasedrecodingandtechnicalannotationofbiomedicalphenotypedata

SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data

Ejemplares similares