Laporkan Masalah

EKSTRAKSI LOKASI DAN TOPIK PADA BERITA ONLINE BERBAHASA INDONESIA

AZIZ MUSTIKA AJI, Anny Kartika Sari, S.Si., M.Sc., Ph.D

2018 | Skripsi | S1 ILMU KOMPUTER

Seiring dengan perkembangan teknologi internet yang ada saat ini, internet telah mengubah cara manusia dalam membaca berita. Dari berita yang hanya tersedia di dalam media cetak dan elektronik, saat ini berita sudah tersedia dalam bentuk online. Pencarian suatu berita berdasarkan lokasi dan topik menjadi hal yang sangat penting sejak penyebaran berita melalui internet terjadi sangat pesat. Pada penelitian ini dibangun sistem ekstraksi informasi khususnya lokasi dan topik dari suatu teks berita online. Ekstraksi lokasi dilakukan dengan menggunakan pengenalan entitas bernama (named entity recognition) dengan metode Hidden Markov Model (HMM). Sedangkan metode ekstraksi topik yang digunakan adalah algoritma term frequency and proportional document frequency (TF*PDF). Hasil dari pembobotan istilah ini kemudian digunakan untuk mengambil kesimpulan topik apa yang sedang tren. Hasil dari penelitian ini adalah sebuah sistem ekstraksi lokasi dan topik berita online dalam suatu periode waktu. Evaluasi ekstraksi lokasi untuk data latih dilakukan dengan menggunakan algoritma k-fold cross-validation dengan menunjukkan hasil akurasi 0,914. Namun untuk presisi, recall, dan F1-score belum bisa dikatakan baik, yaitu secara berurutan bernilai 0,530, 0,439, dan 0,468. Hasil dari ekstraksi topik yang didapatkan adalah sebuah daftar vektor unit dari istilah-istilah yang diidentifikasi memiliki bobot tertinggi.

Along with the rapid development of internet technologies nowadays, ways to find information have shifted significantly. Specially, internet has changed how people access and read news. Print and electronic media used to be the only means of communicating information, but in the digital era today information can also be accessed via online resources. Most notably, since the last fast advancement of the internet as a medium of spreading information, the ability to search and retrieve information based on what happens and where something happens has become very important. In this research, an information extraction system was developed to filter online information efficiently and quickly based on the location of an occurence and the trending topic in a certain range of time. The approach to design and develop this classification model followed the procedures of the Hidden Markov Model (HMM). The results of the classification model were used to do the process of labeling and were expected to be able to accurately take a short summary of the location of an occurence. Meanwhile, the method of term weighting was used to extract the mostly-discussed topic from online news articles in a certain period of time using the algorithm Term Frequency and Proportional Document Frequency (TF*PDF). The results of term weighting were analyzed to reach a conclusion about a particular trending topic in a particular range of time. Finally, the development of an extraction system that was expected to be able to filter online information based on the location of an occurence and the trending topic in a certain period of time was successfully accomplished. Location extraction of the training data was evaluated using the algorithm K-fold cross- validation that resulted in an accuracy value of 0,914. However, the precision, recall, and F1-score cannot be categorized as good because they only reached the values of 0,530, 0,439, and 0,468 respectively. Meanwhile, topic extraction resulted in a list of unit vectors of terms that were identified with highest weights.

Kata Kunci : location extraction, topic extraction, NER, HMM, TF*PDF

  1. S1-2018-316797-abstract.pdf  
  2. S1-2018-316797-bibliography.pdf  
  3. S1-2018-316797-tableofcontent.pdf  
  4. S1-2018-316797-title.pdf