Data oversampling and feature selection for class imbalanced datasets

Authors

DOI:

https://doi.org/10.56294/sctconf2024935

Keywords:

Imbalance Data, Data Mining (DM), Parallelization, Feature Space Projection (FSP), Minority Oversampling (MO), Kernel Canonical Correlation Adaptive Subspaces (KCCAS), Whale Optimization (WO)

Abstract

Introduction: significant advancements and modifications have been implemented in data classification (DC) in the past few decades. Due to their infinite quantity and imbalance, data becomes challenging for classification. The biggest concern in DM (Data Mining) is Class Imbalance (CI). To avoid these issues in recent work proposed map reduce based data parallelization of class imbalanced datasets.

Method: a novel Over Sampling (OS) technique called Minority Oversampling in Kernel Canonical Correlation Adaptive Subspaces (MOKCCAS) has been suggested with the objective to minimize data loss throughout (FSP) Feature Space Projections. This technique takes advantage of the constant Feature Extraction (FE) capability of a version of the ASSOM (Adaptive Subspace Self-Organizing Maps) that is derived from Kernel Canonical Correlation Analysis (KCCA). And in classification, Feature Selection (FS) plays an important role because the acquired dataset might contain large volume of samples, utilizing all features of samples from the dataset for classification will decrease the classifier performance. And then data parallelization will be done by using map reduce framework to solve this computation requirement problem.

Results: then proposes a feature selection model using Mutated whale optimization (MWO) methods and produces features and reduces the time consumption. Finally proposed class balancing model will be tested using uniform distribution based enhanced adaptive neuro fuzzy inference system (UDANFIS). Test outcomes validate the efficiency of the suggested technique by precision, recall, accuracy and Error Rate (ER).

Conclusions: the study subsequently suggests a novel OS approach called MOKCCAS to lessen the loss of data throughout feature space projection.

References

1. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, and Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications, 73, pp. 220-239. https://doi.org/10.1016/j.eswa.2016.12.035.

2. Gosain A, and Sardana S. Handling class imbalance problem using oversampling techniques: A review. In international conference on advances in computing, communications and informatics (ICACCI), pp. 79-85. https://doi.org/10.1109/ICACCI.2017.8125820.

3. Jian C, Gao J, and Ao Y. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing, 193, pp. 115-122. https://doi.org/10.1016/j.neucom.2016.02.006.

4. Menardi G, and Torelli N. Training and assessing classification rules with imbalanced data. Data mining and knowledge discovery, 28, pp. 92-122. https://doi.org/10.1007/s10618-012-0295-5.

5. Błaszczyński J, and Stefanowski J. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150, pp. 529-542. https://doi.org/10.1016/j.neucom.2014.07.064.

6. Oksuz K, Cam BC, Kalkan S, and Akbas E. Imbalance problems in object detection: A review. IEEE transactions on pattern analysis and machine intelligence, 43(10), pp. 3388-3415. https://doi.org/10.1109/TPAMI.2020.2981890.

7. Thabtah F, Hammoud S, Kamalov F, and Gonsalves A. Data imbalance in classification: Experimental evaluation. Information Sciences, 513, pp. 429-441. https://doi.org/10.1016/j.ins.2019.11.004.

8. Vuttipittayamongkol P, Elyan E, Petrovski A, and Jayne C. Overlap-based undersampling for improving imbalanced data classification. In Intelligent Data Engineering and Automated Learning–IDEAL: 19th International Conference, Proceedings, Part I 19 pp. 689-697. https://doi.org/10.1007/978-3-030-03493-1_72.

9. Mathew J, Pang CK, Luo M, and Leong WH. Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE transactions on neural networks and learning systems, 29(9), pp. 4065-4076. https://doi.org/10.1109/TNNLS.2017.2751612.

10. Xu Y. Maximum margin of twin spheres support vector machine for imbalanced data classification. IEEE transactions on cybernetics, 47(6), pp. 1540-1550. https://doi.org/10.1109/TCYB.2016.2551735.

11. Krawczyk B, Koziarski M, and Woźniak M. Radial-based oversampling for multiclass imbalanced data classification. IEEE transactions on neural networks and learning systems, 31(8), pp. 2818-2831. https://doi.org/10.1109/TNNLS.2019.2913673.

12. Yu H, Yang X, Zheng S, and Sun C. Active learning from imbalanced data: A solution of online weighted extreme learning machine. IEEE transactions on neural networks and learning systems, 30(4), pp. 1088-1103. https://doi.org/10.1109/TNNLS.2018.2855446.

13. Agrawal A, Viktor HL, and Paquet E. SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling. In 7Th international joint conference on knowledge discovery, knowledge engineering and knowledge management (IC3k), 1, pp. 226-234.

14. Zhang Y, Fu P, Liu W, and Chen G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Computing and Applications, 25, pp. 927-935. https://doi.org/10.1007/s00521-014-1584-2.

15. Ahmed MM, Houssein EH, Hassanien AE, Taha A, and Hassanien E. Maximizing lifetime of wireless sensor networks based on whale optimization algorithm. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, pp. 724-733. https://doi.org/10.1007/978-3-319-64861-3_68.

16. Kumawat IR, Nanda SJ, and Maddila RK. Multi-objective whale optimization. In Tencon ieee region 10 conference, pp. 2747-2752. https://doi.org/10.1109/TENCON.2017.8228329.

17. Sharawi M, Zawbaa HM, and Emary E. Feature selection approach based on whale optimization algorithm. In Ninth international conference on advanced computational intelligence (ICACI), pp. 163-168. https://doi.org/10.1109/ICACI.2017.7974502.

18. Salgotra R, Singh U, and Saha S. On some improved versions of whale optimization algorithm. Arabian Journal for Science and Engineering, 44, pp. 9653-9691. https://doi.org/10.1007/s13369-019-04016-0.

19. Ling Y, Zhou Y, and Luo Q. Lévy flight trajectory-based whale optimization algorithm for global optimization. IEEE access, 5, pp. 6168-6186. https://doi.org/10.1109/ACCESS.2017.2695498.

20. Qiao W, Huang K, Azimi M, and Han S. A novel hybrid prediction model for hourly gas consumption in supply side based on improved whale optimization algorithm and relevance vector machine. IEEE access, 7, pp. 88218-88230. https://doi.org/10.1109/ACCESS.2019.2918156.

21. Janalipour M, and Mohammadzadeh A. Building damage detection using object-based image analysis and ANFIS from high-resolution image (case study: BAM earthquake, Iran). IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(5), pp. 1937-1945. https://doi.org/10.1109/JSTARS.2015.2458582.

22. Rubio JDJ, Cruz DR, Elias I, Ochoa G, Balcazar R, and Aguilar A. ANFIS system for classification of brain signals. Journal of Intelligent & Fuzzy Systems, 37(3), pp. 4033-4041. https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs190207#:~:text=DOI%3A-,10.3233/JIFS%2D190207,-Journal%3A%20Journal.

23. Mlakić D, Baghaee HR, and Nikolovski S. A novel ANFIS-based islanding detection for inverter-interfaced microgrids. IEEE Transactions on Smart Grid, 10(4), pp. 4411-4424.

24. Selvapandian A, and Manivannan K. Fusion based glioma brain tumor detection and segmentation using ANFIS classification. Computer methods and programs in biomedicine, 166, pp. 33-38. https://doi.org/10.1016/j.cmpb.2018.09.006.

25. Priyadarshini L, and Shrinivasan L. Design of an ANFIS based decision support system for diabetes diagnosis. In International Conference on Communication and Signal Processing (ICCSP), pp. 1486-1489. https://doi.org/10.1109/ICCSP48568.2020.9182163.

Downloads

Published

2024-01-01

How to Cite

1.
Krishnakuma V, Sangeetha V. Data oversampling and feature selection for class imbalanced datasets. Salud, Ciencia y Tecnología - Serie de Conferencias [Internet]. 2024 Jan. 1 [cited 2024 Nov. 6];3:935. Available from: https://conferencias.ageditor.ar/index.php/sctconf/article/view/843