Cargando…

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

BACKGROUND: In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dalianis, Hercules, Velupillai, Sumithra
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895734/ https://www.ncbi.nlm.nih.gov/pubmed/20618985 http://dx.doi.org/10.1186/2041-1480-1-6

_version_	1782183286047506432
author	Dalianis, Hercules Velupillai, Sumithra
author_facet	Dalianis, Hercules Velupillai, Sumithra
author_sort	Dalianis, Hercules
collection	PubMed
description	BACKGROUND: In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident. RESULTS: We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators. CONCLUSIONS: Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.
format	Text
id	pubmed-2895734
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28957342010-07-06 De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields Dalianis, Hercules Velupillai, Sumithra J Biomed Semantics Research BACKGROUND: In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident. RESULTS: We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators. CONCLUSIONS: Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks. BioMed Central 2010-04-12 /pmc/articles/PMC2895734/ /pubmed/20618985 http://dx.doi.org/10.1186/2041-1480-1-6 Text en Copyright ©2010 Dalianis and Velupillai; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Dalianis, Hercules Velupillai, Sumithra De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title	De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title_full	De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title_fullStr	De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title_full_unstemmed	De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title_short	De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
title_sort	de-identifying swedish clinical text - refinement of a gold standard and experiments with conditional random fields
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2895734/ https://www.ncbi.nlm.nih.gov/pubmed/20618985 http://dx.doi.org/10.1186/2041-1480-1-6
work_keys_str_mv	AT dalianishercules deidentifyingswedishclinicaltextrefinementofagoldstandardandexperimentswithconditionalrandomfields AT velupillaisumithra deidentifyingswedishclinicaltextrefinementofagoldstandardandexperimentswithconditionalrandomfields

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

Ejemplares similares