Cargando…

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification

Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a...

Descripción completa

Detalles Bibliográficos
Autores principales: Inurrieta, Uxoa, Aduriz, Itziar, Díaz de Ilarraza, Arantza, Labaka, Gorka, Sarasola, Kepa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7451662/
https://www.ncbi.nlm.nih.gov/pubmed/32853283
http://dx.doi.org/10.1371/journal.pone.0237767
_version_ 1783575024522756096
author Inurrieta, Uxoa
Aduriz, Itziar
Díaz de Ilarraza, Arantza
Labaka, Gorka
Sarasola, Kepa
author_facet Inurrieta, Uxoa
Aduriz, Itziar
Díaz de Ilarraza, Arantza
Labaka, Gorka
Sarasola, Kepa
author_sort Inurrieta, Uxoa
collection PubMed
description Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.
format Online
Article
Text
id pubmed-7451662
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-74516622020-09-02 Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification Inurrieta, Uxoa Aduriz, Itziar Díaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa PLoS One Research Article Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work. Public Library of Science 2020-08-27 /pmc/articles/PMC7451662/ /pubmed/32853283 http://dx.doi.org/10.1371/journal.pone.0237767 Text en © 2020 Inurrieta et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Inurrieta, Uxoa
Aduriz, Itziar
Díaz de Ilarraza, Arantza
Labaka, Gorka
Sarasola, Kepa
Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title_full Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title_fullStr Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title_full_unstemmed Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title_short Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
title_sort learning about phraseology from corpora: a linguistically motivated approach for multiword expression identification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7451662/
https://www.ncbi.nlm.nih.gov/pubmed/32853283
http://dx.doi.org/10.1371/journal.pone.0237767
work_keys_str_mv AT inurrietauxoa learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT adurizitziar learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT diazdeilarrazaarantza learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT labakagorka learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT sarasolakepa learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification