Cargando…
Probing language identity encoded in pre-trained multilingual models: a typological view
Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/ https://www.ncbi.nlm.nih.gov/pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899 |
_version_ | 1784695088063971328 |
---|---|
author | Zheng, Jianyu Liu, Ying |
author_facet | Zheng, Jianyu Liu, Ying |
author_sort | Zheng, Jianyu |
collection | PubMed |
description | Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information. |
format | Online Article Text |
id | pubmed-9044357 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-90443572022-04-28 Probing language identity encoded in pre-trained multilingual models: a typological view Zheng, Jianyu Liu, Ying PeerJ Comput Sci Artificial Intelligence Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information. PeerJ Inc. 2022-03-15 /pmc/articles/PMC9044357/ /pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899 Text en ©2022 Zheng and Liu https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Artificial Intelligence Zheng, Jianyu Liu, Ying Probing language identity encoded in pre-trained multilingual models: a typological view |
title | Probing language identity encoded in pre-trained multilingual models: a typological view |
title_full | Probing language identity encoded in pre-trained multilingual models: a typological view |
title_fullStr | Probing language identity encoded in pre-trained multilingual models: a typological view |
title_full_unstemmed | Probing language identity encoded in pre-trained multilingual models: a typological view |
title_short | Probing language identity encoded in pre-trained multilingual models: a typological view |
title_sort | probing language identity encoded in pre-trained multilingual models: a typological view |
topic | Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/ https://www.ncbi.nlm.nih.gov/pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899 |
work_keys_str_mv | AT zhengjianyu probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview AT liuying probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview |