Cargando…

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including la...

Descripción completa

Detalles Bibliográficos
Autores principales: Hidayatullah, Ahmad Fathan, Apong, Rosyzie Anna, Lai, Daphne T.C., Qazi, Atika
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10319257/
https://www.ncbi.nlm.nih.gov/pubmed/37409088
http://dx.doi.org/10.7717/peerj-cs.1312
_version_ 1785068209550917632
author Hidayatullah, Ahmad Fathan
Apong, Rosyzie Anna
Lai, Daphne T.C.
Qazi, Atika
author_facet Hidayatullah, Ahmad Fathan
Apong, Rosyzie Anna
Lai, Daphne T.C.
Qazi, Atika
author_sort Hidayatullah, Ahmad Fathan
collection PubMed
description With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.
format Online
Article
Text
id pubmed-10319257
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-103192572023-07-05 Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets Hidayatullah, Ahmad Fathan Apong, Rosyzie Anna Lai, Daphne T.C. Qazi, Atika PeerJ Comput Sci Computational Linguistics With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts. PeerJ Inc. 2023-06-22 /pmc/articles/PMC10319257/ /pubmed/37409088 http://dx.doi.org/10.7717/peerj-cs.1312 Text en ©2023 Hidayatullah et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Computational Linguistics
Hidayatullah, Ahmad Fathan
Apong, Rosyzie Anna
Lai, Daphne T.C.
Qazi, Atika
Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title_full Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title_fullStr Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title_full_unstemmed Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title_short Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
title_sort corpus creation and language identification for code-mixed indonesian-javanese-english tweets
topic Computational Linguistics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10319257/
https://www.ncbi.nlm.nih.gov/pubmed/37409088
http://dx.doi.org/10.7717/peerj-cs.1312
work_keys_str_mv AT hidayatullahahmadfathan corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets
AT apongrosyzieanna corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets
AT laidaphnetc corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets
AT qaziatika corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets