HANDLING IMBALANCE PROBLEM IN HATE SPEECH CLASSIFICATION USING SAMPLING-BASED METHODS
HENG, Rathpisey, Teguh Bharata Adji, S.T., M.T., M.Eng., Ph.D.
2020 | Tesis | MAGISTER TEKNIK ELEKTROSocial network is an online platform which has given people to build a relationship by communicating each other through tweet, message, and opinion freely across the world. Unfortunately, this platform is misused by some people to bully, insult, attack, discriminate against people on the basis of attributes such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation or gender identity. The dissemination of hate speech on social media has been rapidly increasing for the last few years and the activities are believed to be connected to some terror consequences which urged researchers into action to counter the issue. Hate speech detection is a familiar topic which is labeled as a classification task in the research of natural language processing. Even though many of the existing dataset which has been collected from the previous studies are found to be highly imbalanced, the issue has been disregarded by most of the previous works. This might significantly have an impact on the results of the classification model performance. Among state-of-the-art that deal with imbalance problem, the sampling-based method is the most effective approach to solve the issue. This study aims to use and compare various methods in oversampling and undersampling methods include Random Oversampling (ROS), Synthetic Minority Technique (SMOTE), Adaptive Synthetic (ADASYN) and Random Undersampling (RUS) as solutions to an imbalanced dataset in hate speech classification. With three basic machine learning classifiers i.e. Support Vector Machine, Logistic Regression and Naive Bayes, the evaluation results show that the oversampling approach improves the accuracy and the overall performance of three classifiers. Among all resampling techniques and machine learning algorithms, Logistic Regression enforced by ROS performed the best with an overall accuracy of 0.91 and F1-Score of 0.95.
Social network is an online platform which has given people to build a relationship by communicating each other through tweet, message, and opinion freely across the world. Unfortunately, this platform is misused by some people to bully, insult, attack, discriminate against people on the basis of attributes such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation or gender identity. The dissemination of hate speech on social media has been rapidly increasing for the last few years and the activities are believed to be connected to some terror consequences which urged researchers into action to counter the issue. Hate speech detection is a familiar topic which is labeled as a classification task in the research of natural language processing. Even though many of the existing dataset which has been collected from the previous studies are found to be highly imbalanced, the issue has been disregarded by most of the previous works. This might significantly have an impact on the results of the classification model performance. Among state-of-the-art that deal with imbalance problem, the sampling-based method is the most effective approach to solve the issue. This study aims to use and compare various methods in oversampling and undersampling methods include Random Oversampling (ROS), Synthetic Minority Technique (SMOTE), Adaptive Synthetic (ADASYN) and Random Undersampling (RUS) as solutions to an imbalanced dataset in hate speech classification. With three basic machine learning classifiers i.e. Support Vector Machine, Logistic Regression and Naive Bayes, the evaluation results show that the oversampling approach improves the accuracy and the overall performance of three classifiers. Among all resampling techniques and machine learning algorithms, Logistic Regression enforced by ROS performed the best with an overall accuracy of 0.91 and F1-Score of 0.95.
Kata Kunci : hate speech, text classification, imbalanced data, resampling, oversampling, undersampling.