Penanganan Data Tidak Seimbang Menggunakan Cascade Modeling Pada Kasus Klasifikasi Tuberkulosis Berbasis Citra Chest X-Ray
Nurraudya Tuz Zahra, Wahyono, S.Kom., Ph.D
2023 | Tesis | MAGISTER KECERDASAN ARTIFISIAL
In medical diagnosis, class imbalance often occurs because the distribution of classes in large clinical datasets is unbalanced. Class imbalance occurs when the classes in a dataset are not evenly distributed, with some classes having a much smaller number of samples (minority class) than other classes (majority class). This is especially the case in the TBX 11K tuberculosis disease dataset, where the number of tuberculosis samples is much less than non-tuberculosis samples. Conditions like this can affect the performance of the classification model and result in decreased performance in the minority class. To handle this problem, this research uses a Cascade Modeling approach with the Random Forest (RF) classification method.
Cascade Modeling is a method that consists of a series of models or processing steps that are executed sequentially. The Random Forest (RF) method is started for classifying the class Non-Tuberculosis and Tuberculosis. For Non-Tuberculosis, the new RF method is implemented to classify the class healthy and sick. The process is repeated until the single class is obtained. This research conducted a pre-processing stage by resizing the image dimension from 512×512 to 224×224 so that it can be used as input for the base feature extraction model. The feature extraction method used is the VGG16 architecture, and the features are then used as input for the classification process using the Random Forest (RF) method.
The research results show that the Cascade Modeling approach has succeeded in improving performance in minority classes, especially the Active Tuberculosis (ATB) class. The model without the Cascade Model approach showed good time efficiency in the data classification process when compared to the model with the Cascade approach.
Kata Kunci : Imbalance Data, Data Chest X-Ray, Cascade Modeling, Convolutional Neural Network (CNN) feature extraction, dan Random Forest (RF).