Cargando…

Probing language identity encoded in pre-trained multilingual models: a typological view

Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Jianyu, Liu, Ying
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/
https://www.ncbi.nlm.nih.gov/pubmed/35494801
http://dx.doi.org/10.7717/peerj-cs.899
_version_ 1784695088063971328
author Zheng, Jianyu
Liu, Ying
author_facet Zheng, Jianyu
Liu, Ying
author_sort Zheng, Jianyu
collection PubMed
description Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information.
format Online
Article
Text
id pubmed-9044357
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-90443572022-04-28 Probing language identity encoded in pre-trained multilingual models: a typological view Zheng, Jianyu Liu, Ying PeerJ Comput Sci Artificial Intelligence Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information. PeerJ Inc. 2022-03-15 /pmc/articles/PMC9044357/ /pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899 Text en ©2022 Zheng and Liu https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Artificial Intelligence
Zheng, Jianyu
Liu, Ying
Probing language identity encoded in pre-trained multilingual models: a typological view
title Probing language identity encoded in pre-trained multilingual models: a typological view
title_full Probing language identity encoded in pre-trained multilingual models: a typological view
title_fullStr Probing language identity encoded in pre-trained multilingual models: a typological view
title_full_unstemmed Probing language identity encoded in pre-trained multilingual models: a typological view
title_short Probing language identity encoded in pre-trained multilingual models: a typological view
title_sort probing language identity encoded in pre-trained multilingual models: a typological view
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/
https://www.ncbi.nlm.nih.gov/pubmed/35494801
http://dx.doi.org/10.7717/peerj-cs.899
work_keys_str_mv AT zhengjianyu probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview
AT liuying probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview