Evaluasi IndoBERT sebagai Embedding Berbasis Konteks untuk Klasifikasi Emosi pada Teks Bahasa Indonesia
Alfa Natasya Limbong, Dr. Sigit Priyanta, S.Si., M.Kom.
2025 | Tesis | MAGISTER KECERDASAN ARTIFISIAL
Dalam era digital, media sosial seperti Twitter menjadi
wadah penting bagi masyarakat untuk mengekspresikan emosi. Namun, analisis
emosi pada teks Bahasa Indonesia masih menghadapi tantangan karena keterbatasan
embedding statis seperti Word2Vec dan FastText yang kurang mampu menangkap
konteks emosional secara menyeluruh. Penelitian ini mengusulkan pemanfaatan
IndoBERT sebagai model klasifikasi emosi berbasis contextual embedding pada
teks Twitter berbahasa Indonesia. Dataset yang digunakan adalah Twitter Emotion
Dataset dengan 4.401 tweet yang mencakup lima kategori emosi, melalui tahapan
prapemrosesan (lowercase, konversi emotikon, normalisasi slang, pengubahan
singkatan, serta penghapusan stopword) dan teknik augmentasi backtranslation.
Model dievaluasi menggunakan akurasi, presisi, recall, dan F1-score, serta
dibandingkan dengan IndoBERT Base dan embedding statis. Hasil menunjukkan bahwa
IndoBERT Large secara konsisten melampaui IndoBERT Base, dengan performa
terbaik diperoleh pada skenario prapemrosesan tanpa stopword dengan
hyperparameter tuning, menghasilkan akurasi 79%, precision 81%, recall 79%, dan
F1-score 80%. Kinerja ini lebih tinggi dibandingkan baseline IndoBERT Large
tanpa tuning (akurasi 77%, F1-score 78%), serta jauh lebih unggul dibandingkan
embedding statis FastText (F1-score 65,36 - 69,23%). Temuan ini menegaskan
bahwa contextual embedding IndoBERT, terutama varian Large, lebih efektif dalam
menangkap nuansa emosi pada teks Bahasa Indonesia dibandingkan pendekatan
berbasis embedding statis, sekaligus menunjukkan pentingnya strategi
preprocessing dan hyperparameter tuning dalam meningkatkan akurasi klasifikasi
emosi.
In the digital era, social media platforms such as Twitter have become important spaces for people to express their emotions. However, emotion analysis in Indonesian texts still faces challenges due to the limitations of static embeddings such as Word2Vec and FastText, which are less capable of fully capturing emotional context. This study proposes the utilization of IndoBERT as an emotion classification model based on contextual embedding for Indonesian Twitter texts. The dataset used is the Twitter Emotion Dataset consisting of 4,401 tweets covering five emotion categories, processed through several preprocessing stages (lowercasing, emoticon conversion, slang normalization, abbreviation replacement, and stopword removal) along with the backtranslation augmentation technique. The model was evaluated using accuracy, precision, recall, and F1-score, and compared with IndoBERT Base as well as static embeddings. The results show that IndoBERT Large consistently outperforms IndoBERT Base, with the best performance achieved under the scenario of preprocessing without stopwords and hyperparameter tuning, yielding 79?curacy, 81% precision, 79% recall, and 80?-score. This performance surpasses the baseline IndoBERT Large without tuning (77?curacy and 78?-score) and is significantly superior compared to static embeddings such as FastText (65.36-69.23?-score). These findings emphasize that contextual embedding with IndoBERT, particularly the Large variant, is more effective in capturing emotional nuances in Indonesian texts compared to static embedding approaches, while also highlighting the importance of preprocessing strategies and hyperparameter tuning in improving the accuracy of emotion classification.
Kata Kunci : IndoBERT, Emotion Classification, Natural Language Processing (NLP), Contextual Embedding, Backtranslation, Preprocessing, Twitter.