Cargando…

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being...

Descripción completa

Detalles Bibliográficos
Autores principales: Bravo-Candel, Daniel, López-Hernández, Jésica, García-Díaz, José Antonio, Molina-Molina, Fernando, García-Sánchez, Francisco
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8122440/
https://www.ncbi.nlm.nih.gov/pubmed/33919018
http://dx.doi.org/10.3390/s21092893
_version_ 1783692618858758144
author Bravo-Candel, Daniel
López-Hernández, Jésica
García-Díaz, José Antonio
Molina-Molina, Fernando
García-Sánchez, Francisco
author_facet Bravo-Candel, Daniel
López-Hernández, Jésica
García-Díaz, José Antonio
Molina-Molina, Fernando
García-Sánchez, Francisco
author_sort Bravo-Candel, Daniel
collection PubMed
description Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results.
format Online
Article
Text
id pubmed-8122440
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-81224402021-05-16 Automatic Correction of Real-Word Errors in Spanish Clinical Texts Bravo-Candel, Daniel López-Hernández, Jésica García-Díaz, José Antonio Molina-Molina, Fernando García-Sánchez, Francisco Sensors (Basel) Article Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results. MDPI 2021-04-21 /pmc/articles/PMC8122440/ /pubmed/33919018 http://dx.doi.org/10.3390/s21092893 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Bravo-Candel, Daniel
López-Hernández, Jésica
García-Díaz, José Antonio
Molina-Molina, Fernando
García-Sánchez, Francisco
Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title_full Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title_fullStr Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title_full_unstemmed Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title_short Automatic Correction of Real-Word Errors in Spanish Clinical Texts
title_sort automatic correction of real-word errors in spanish clinical texts
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8122440/
https://www.ncbi.nlm.nih.gov/pubmed/33919018
http://dx.doi.org/10.3390/s21092893
work_keys_str_mv AT bravocandeldaniel automaticcorrectionofrealworderrorsinspanishclinicaltexts
AT lopezhernandezjesica automaticcorrectionofrealworderrorsinspanishclinicaltexts
AT garciadiazjoseantonio automaticcorrectionofrealworderrorsinspanishclinicaltexts
AT molinamolinafernando automaticcorrectionofrealworderrorsinspanishclinicaltexts
AT garciasanchezfrancisco automaticcorrectionofrealworderrorsinspanishclinicaltexts