Cargando…
Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets
With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including la...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10319257/ https://www.ncbi.nlm.nih.gov/pubmed/37409088 http://dx.doi.org/10.7717/peerj-cs.1312 |
_version_ | 1785068209550917632 |
---|---|
author | Hidayatullah, Ahmad Fathan Apong, Rosyzie Anna Lai, Daphne T.C. Qazi, Atika |
author_facet | Hidayatullah, Ahmad Fathan Apong, Rosyzie Anna Lai, Daphne T.C. Qazi, Atika |
author_sort | Hidayatullah, Ahmad Fathan |
collection | PubMed |
description | With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts. |
format | Online Article Text |
id | pubmed-10319257 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-103192572023-07-05 Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets Hidayatullah, Ahmad Fathan Apong, Rosyzie Anna Lai, Daphne T.C. Qazi, Atika PeerJ Comput Sci Computational Linguistics With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts. PeerJ Inc. 2023-06-22 /pmc/articles/PMC10319257/ /pubmed/37409088 http://dx.doi.org/10.7717/peerj-cs.1312 Text en ©2023 Hidayatullah et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Computational Linguistics Hidayatullah, Ahmad Fathan Apong, Rosyzie Anna Lai, Daphne T.C. Qazi, Atika Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title | Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title_full | Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title_fullStr | Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title_full_unstemmed | Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title_short | Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets |
title_sort | corpus creation and language identification for code-mixed indonesian-javanese-english tweets |
topic | Computational Linguistics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10319257/ https://www.ncbi.nlm.nih.gov/pubmed/37409088 http://dx.doi.org/10.7717/peerj-cs.1312 |
work_keys_str_mv | AT hidayatullahahmadfathan corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets AT apongrosyzieanna corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets AT laidaphnetc corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets AT qaziatika corpuscreationandlanguageidentificationforcodemixedindonesianjavaneseenglishtweets |