Cargando…

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

BACKGROUND: Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers...

Descripción completa

Detalles Bibliográficos
Autores principales: Jimeno Yepes, Antonio, Prieur-Gaston, Élise, Névéol, Aurélie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3651320/
https://www.ncbi.nlm.nih.gov/pubmed/23631733
http://dx.doi.org/10.1186/1471-2105-14-146
_version_ 1782269204309737472
author Jimeno Yepes, Antonio
Prieur-Gaston, Élise
Névéol, Aurélie
author_facet Jimeno Yepes, Antonio
Prieur-Gaston, Élise
Névéol, Aurélie
author_sort Jimeno Yepes, Antonio
collection PubMed
description BACKGROUND: Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. RESULTS: We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. CONCLUSIONS: We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.
format Online
Article
Text
id pubmed-3651320
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36513202013-05-14 Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text Jimeno Yepes, Antonio Prieur-Gaston, Élise Névéol, Aurélie BMC Bioinformatics Research Article BACKGROUND: Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. RESULTS: We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. CONCLUSIONS: We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts. BioMed Central 2013-04-30 /pmc/articles/PMC3651320/ /pubmed/23631733 http://dx.doi.org/10.1186/1471-2105-14-146 Text en Copyright © 2013 Jimeno Yepes et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Jimeno Yepes, Antonio
Prieur-Gaston, Élise
Névéol, Aurélie
Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title_full Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title_fullStr Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title_full_unstemmed Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title_short Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
title_sort combining medline and publisher data to create parallel corpora for the automatic translation of biomedical text
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3651320/
https://www.ncbi.nlm.nih.gov/pubmed/23631733
http://dx.doi.org/10.1186/1471-2105-14-146
work_keys_str_mv AT jimenoyepesantonio combiningmedlineandpublisherdatatocreateparallelcorporafortheautomatictranslationofbiomedicaltext
AT prieurgastonelise combiningmedlineandpublisherdatatocreateparallelcorporafortheautomatictranslationofbiomedicaltext
AT neveolaurelie combiningmedlineandpublisherdatatocreateparallelcorporafortheautomatictranslationofbiomedicaltext