Cargando…

Probing language identity encoded in pre-trained multilingual models: a typological view

Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zheng, Jianyu, Liu, Ying
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/ https://www.ncbi.nlm.nih.gov/pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899

_version_	1784695088063971328
author	Zheng, Jianyu Liu, Ying
author_facet	Zheng, Jianyu Liu, Ying
author_sort	Zheng, Jianyu
collection	PubMed
description	Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information.
format	Online Article Text
id	pubmed-9044357
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-90443572022-04-28 Probing language identity encoded in pre-trained multilingual models: a typological view Zheng, Jianyu Liu, Ying PeerJ Comput Sci Artificial Intelligence Pre-trained multilingual models have been extensively used in cross-lingual information processing tasks. Existing work focuses on improving the transferring performance of pre-trained multilingual models but ignores the linguistic properties that models preserve at encoding time—“language identity”. We investigated the capability of state-of-the-art pre-trained multilingual models (mBERT, XLM, XLM-R) to preserve language identity through language typology. We explored model differences and variations in terms of languages, typological features, and internal hidden layers. We found the order of ability in preserving language identity of whole model and each of its hidden layers is: mBERT > XLM-R > XLM. Furthermore, all three models capture morphological, lexical, word order and syntactic features well, but perform poorly on nominal and verbal features. Finally, our results show that the ability of XLM-R and XLM remains stable across layers, but the ability of mBERT fluctuates severely. Our findings summarize the ability of each pre-trained multilingual model and its hidden layer to store language identity and typological features. It provides insights for later researchers in processing cross-lingual information. PeerJ Inc. 2022-03-15 /pmc/articles/PMC9044357/ /pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899 Text en ©2022 Zheng and Liu https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Artificial Intelligence Zheng, Jianyu Liu, Ying Probing language identity encoded in pre-trained multilingual models: a typological view
title	Probing language identity encoded in pre-trained multilingual models: a typological view
title_full	Probing language identity encoded in pre-trained multilingual models: a typological view
title_fullStr	Probing language identity encoded in pre-trained multilingual models: a typological view
title_full_unstemmed	Probing language identity encoded in pre-trained multilingual models: a typological view
title_short	Probing language identity encoded in pre-trained multilingual models: a typological view
title_sort	probing language identity encoded in pre-trained multilingual models: a typological view
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044357/ https://www.ncbi.nlm.nih.gov/pubmed/35494801 http://dx.doi.org/10.7717/peerj-cs.899
work_keys_str_mv	AT zhengjianyu probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview AT liuying probinglanguageidentityencodedinpretrainedmultilingualmodelsatypologicalview

Probing language identity encoded in pre-trained multilingual models: a typological view

Ejemplares similares