Cargando…
Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification
Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7451662/ https://www.ncbi.nlm.nih.gov/pubmed/32853283 http://dx.doi.org/10.1371/journal.pone.0237767 |
_version_ | 1783575024522756096 |
---|---|
author | Inurrieta, Uxoa Aduriz, Itziar Díaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa |
author_facet | Inurrieta, Uxoa Aduriz, Itziar Díaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa |
author_sort | Inurrieta, Uxoa |
collection | PubMed |
description | Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work. |
format | Online Article Text |
id | pubmed-7451662 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-74516622020-09-02 Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification Inurrieta, Uxoa Aduriz, Itziar Díaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa PLoS One Research Article Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work. Public Library of Science 2020-08-27 /pmc/articles/PMC7451662/ /pubmed/32853283 http://dx.doi.org/10.1371/journal.pone.0237767 Text en © 2020 Inurrieta et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Inurrieta, Uxoa Aduriz, Itziar Díaz de Ilarraza, Arantza Labaka, Gorka Sarasola, Kepa Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title_full | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title_fullStr | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title_full_unstemmed | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title_short | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification |
title_sort | learning about phraseology from corpora: a linguistically motivated approach for multiword expression identification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7451662/ https://www.ncbi.nlm.nih.gov/pubmed/32853283 http://dx.doi.org/10.1371/journal.pone.0237767 |
work_keys_str_mv | AT inurrietauxoa learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT adurizitziar learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT diazdeilarrazaarantza learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT labakagorka learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT sarasolakepa learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification |