Deteksi SQL Injection dengan NLP Sebagai Persiapan Untuk Menyambungkan Aplikasi Kependudukan “SEMBADA MAJU” ke Internet
DWI RANGGA RHADITYA SIWI WIDODO, Drs. Janoe Hendarto, M.I.Kom.
2024 | Skripsi | ILMU KOMPUTER
SEMBADA MAJU, Jumeneng Community Database System, is a desktop-based population application created to support population administration activities in Dukuh Jumeneng. This application is not connected to the internet considering the risk of data theft. This research was conducted to reduce this risk by detecting SQL injection using NLP before execution. In this study, four types of classification models and three types of pre-processing will be evaluated in detecting SQL injection. The classification model in question is logistic regression, BiLSTM, TextCNN, and ResNet. Meanwhile, the pre-processing in question is generalization, elimination, and no pre-processing. The purpose was to see the performance of each classification model in detecting SQL injection and the effect of different pre-processing on the performance and processing time of each classification model.
Each combination of classification models with pre-processing methods will be trained with a small dataset containing SQL injection and regular statements. The performance and training time of each combination will be evaluated at this stage. The trained model will then be validated by classifying it on a larger dataset consisting entirely of SQL injection. Validation is carried out to determine whether the model can identify SQL injections that are not included in the training dataset or not. The accuracy and average processing time from pre-processing to classification by each combination will be evaluated at this stage.
The evaluation results show that the generalization pre-processing method tends to help models in achieving the best performance. At the training stage, almost every model achieved ~99?curacy, except for ResNet with a pre-processing method of elimination. This combination is broken and classified all inputs as SQL injections, thus the accuracy is only ~34%. Validation using larger dataset shows performance decline across all models, with logistic regression’s accuracy dropping drastically to ~19%, indicating its inability to generalize. On the other hand, deep learning models show better performance. BiLSTM shows accuracy in the range of 83-91%, while ResNet shows accuracy in the range of 78-92%. In both models, the best performance is achieved by doing generalization as pre-processing. For TextCNN, the performance doesn’t change much across different pre-processing, showing accuracy in range of 84-86%.
In conclusion, deep learning models such as BiLSTM, TextCNN, and ResNet generalize better to unforseen data compared to traditional machine learning model like logistic regression. Among pre-processing methods, generalization consistently yields the best accuracy, albeit with slightly higher processing times that remains negligible since at worst, a model only needs 0.002 seconds to process a single string of input.
Kata Kunci : BiLSTM, TF-IDF, TextCNN, ResNet, logistic regression, pre-processing, SQL injection, natural language processing