Optimalisasi Pemilihan Pasangan Bahasa Source-Target dalam Transfer Learning untuk Mesin Penerjemah Bahasa Daerah Indonesia sebagai Low-Resource Language dengan Pendekatan Metrik Kemiripan Linguistik
Coveeta Kosambi, Yunita Sari, S.Kom., M.Sc., Ph.D., Rifki Afina Putri, S.T., M.S, Ph.D.
2025 | Skripsi | ILMU KOMPUTER
The decline in the use of regional languages in Indonesia, such as Ngaju, Madura, and Banjar, has prompted the need for innovative approaches to language preservation through technology. One potential approach is the use of translation engines specifically designed for low-resource languages (LRLs). The main challenge in translating regional languages is the limited amount of parallel data available to effectively train the model.
This study proposes the development of a strategy for selecting source–target language pairs in a transfer learning scenario for translating Indonesian regional languages that are classified as low-resource. This approach uses linguistic similarity metrics, such as Jaccard Similarity, Levenshtein Distance, FastText Similarity, and combinations thereof, to determine the most linguistically relevant source language before the transfer learning process is carried out. To test the effectiveness of this strategy, the study implemented a Transformer-based translation system with a pre-trained mBART50 architecture model, which was trained on Indonesian–several regional language pairs as a baseline without transfer learning, and compared it with two scenarios: Naive Transfer, where the source language was selected based on the largest amount of data, and transfer based on linguistic similarity metrics.
The experiment showed that source language selection based on linguistic similarity provides consistent but limited performance improvements compared to the baseline without transfer, with increases in BLEU scores of +0.25–0.48, ROUGE-L up to +0.38, and chrF up to +0.69 in some scenarios. The best results vary depending on the target language, but strategies based on FastText and metric combinations tend to yield more stable and superior results compared to Jaccard or Levenshtein alone. Therefore, it can be concluded that considering semantic similarity between languages in the selection of transfer learning pairs can improve the quality of low-resource language translations, although the improvement is relatively small on the mBART50 architecture.
Kata Kunci : Neural Machine Translation, Natural Language Processing, Transfer Learning, Transformer, mBart Pre-trained, Low-Resource Language, Bahasa Daerah, BLEU Score