IMPLEMENTASI KECERDASAN BUATAN GPT-5 DALAM EVALUASI TUGAS ESAI MAHASISWA KEDOKTERAN TAHUN KEDUA: STUDI RELIABILITAS DAN KUALITAS UMPAN BALIK
Jody Erlangga, dr. Rachmadya Nur Hidayah, MSc, PhD; Prof. dr. Gandes Retno Rahayu, MMedEd, PhD
2026 | Tesis | S2 Ilmu Pendidikan Kedokteran
Latar Belakang: Penggunaan kecerdasan buatan dalam pendidikan kedokteran semakin berkembang dan membuka peluang untuk mendukung proses penilaian serta pemberian umpan balik. Pada tugas esai yang bersifat formatif, umpan balik yang tepat waktu, konsisten, dan berkualitas sangat penting. Penilaian dan umpan balik dari dosen seringkali menyita banyak waktu. Bukti mengenai reliabilitas penilaian dan persepsi terhadap kualitas umpan balik menggunakan GPT-5 dalam pendidikan kedokteran masih terbatas.
Tujuan: Mengevaluasi reliabilitas penilaian berbasis GPT-5 dibandingkan dengan penilaian dosen serta membandingkan persepsi kualitas umpan balik tertulis dari GPT-5 dan dosen pada esai formatif mahasiswa kedokteran tahun kedua.
Metode: Penelitian kuantitatif ini melibatkan 61 mahasiswa kedokteran preklinik tahun kedua pada blok metodologi penelitian. Mahasiswa mengerjakan empat soal tugas esai formatif. GPT-5 dan dosen secara independen menilai esai serta memberikan umpan balik tertulis dengan rubrik yang sama. Reliabilitas penilaian dianalisis menggunakan intraclass correlation coefficient, sedangkan kualitas umpan balik dinilai oleh mahasiswa dan ahli menggunakan rubrik kualitas umpan balik naratif. Analisis perbandingan berpasangan menggunakan uji Wilcoxon signed-rank.
Hasil: Kesesuaian penilaian GPT-5 dan dosen tergolong baik (ICC[3,1]=0.808; 95% CI, 0.699–0.880). Mahasiswa dan ahli menilai umpan balik GPT-5 lebih tinggi dibandingkan dosen (p<0>
Kesimpulan: GPT-5 menunjukkan reliabilitas yang baik, tetapi cenderung memberi skor lebih tinggi, sehingga memerlukan kalibrasi dan pengawasan manusia. GPT-5 berpotensi menjadi pelengkap dosen dalam penilaian formatif, bukan pengganti utama.
Introduction: The growing use of artificial intelligence in medical education offers new opportunities to support assessment and feedback. In formative essay assignments, timely, consistent, and high-quality feedback is important for learning, but lecturer scoring and feedback can be time-consuming. Evidence on the reliability of scoring and the perceived quality of GPT-5 feedback in medical education remains limited.
Purpose: To evaluate the reliability of GPT-5-based scoring compared with lecturer scoring and to compare the perceived quality of written feedback generated by GPT-5 and by a lecturer for second-year medical students’ formative essays.
Methods: This quantitative study involved 61 second-year preclinical medical students enrolled in a research methodology block. Students completed a four-item formative essay assignment. GPT-5 and the lecturer independently scored the essays and generated written feedback using the same rubric. Feedback sources were anonymized before evaluation. Scoring reliability was assessed using the intraclass correlation coefficient and feedback quality was rated by students and experts using a narrative feedback quality rubric. Paired comparisons were analysed using Wilcoxon signed-rank tests.
Results: Reliability between GPT-5 and lecturer scoring was good (ICC[3,1]=0.808; 95% CI, 0.699-0.880). Students rated GPT-5 feedback higher than lecturer feedback (18.00±1.853 vs 14.72±5.583; p<0>
Conclusion: GPT-5 showed good reliability relative to lecturer scoring but tended to assign higher scores, indicating the need for calibration and human oversight. GPT-5 feedback was perceived as higher quality, supporting its use as an adjunct in formative assessment rather than a standalone replacement.
Kata Kunci : Kecerdasan buatan, Penilaian, Reliabilitas, Esai, Umpan balik