Total views : 583
High Dimensional Unbalanced Data Classification Vs SVM Feature Selection
Background/Objectives: It is well known that the performance of the classification models prone to the class imbalance problem. The class imbalance problem occurs when one class of data severely outnumbers the other classes of data. The classification models learned on Support Vector Machines (SVM) are quite prominent in exhibiting better generalization abilities even in the context of the class imbalance problem. However, it is proved that the high imbalance ratio hinders SVM learning performance. With this concern, this paper presents an empirical study on the viability of SVM in the context of feature selection from moderately and highly unbalanced datasets. Methods/Statistical Analysis: The Support Vector Machine-Recursive Feature Elimination (SVM-RFE) wrapper feature selection is analyzed in this study and its performance on one document analysis and two biomedical unbalanced datasets is compared with two prominent feature selection methods like Chi-Square (CHI) test and Information Gain (IG) using Decision Tree and Naive Bayes classification models. Findings: From this empirical study two major identifications are reported: 1. For the considered scenarios, classification models learned on IG and CHI test are better performed than SVM-RFE feature selection of high class imbalance setting. 2. The SVM-RFE on rebalanced data yielded better performance than SVM-RFE on original data. Application/Improvements: Considered feature selection methods, including SVM-RFE yielded better performance on oversampled data than SVM-RFE on original data. Overall, this study reports models learned on Decision Tree exhibited better performance than the models learned on Naïve Bayes classifier.
Class Imbalance Problem, Chi-Square, Information Gain, Support Vector Machine, SVM-RFE.
- Iqbal R, Murad MAA, Mustapha A, Panahy PHS, Khanahmadliravi N. An experimental study of classification algorithms for crime prediction. Indian Journal of Science and Technology. 2013 Mar; 6(3):4219–25.
- Sivaranjani S, Sivakumari S. A novel approach for serial crime detection with the consideration of class imbalance problem. Indian Journal of Science and Technology. 2015 Dec; 8(34):1–9.
- Sun Y, Wong AC, Kamel MS. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence. 2009; 23(4):687–719.
- He H, Garcia EA. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2009 Sep; 21(9):1263–84.
- Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent Data Analysis Journal. 2002 Oct; 6(5):429–9.
- Wu G, Chang E. Class-boundary alignment for imbalanced dataset learning. Proceedings of ICML‘2003 Workshop on Learning from Imbalanced Data Sets II; Washington DC. 2003 Aug. p. 1–8.
- Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM SIGKDD Explorations. 2004 Jun; 6(1):40–9.
- Prati RC, Batista GEAPA, Monard MC. Class imbalances versus class overlapping: An analysis of a learning system behavior. Proceedings of Mexican International Conference on Artificial Intelligence; 2004 Apr. p. 312–21.
- Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks. 2008 MarApr; 21(2–3):427–36.
- Wasikowski M, Chen XW. Combating the small sample class imbalance problem using feature selection. IEEE Transactions on Knowledge and Data Engineering. 2010 Oct; 22(10):1388–400.
- Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning Journal. 2002; 46:389–22.
- Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and naive bayes. Proceedings of 16th Int’l Conference on Machine Learning; 1999. p. 258–67.
- Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter. 2004 Jun; 6(1):80–9.
- Bolon-Canedo V, Sanchez-Marono A, Betanzos A, Benitez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Science. 2014 Oct; 282:111–35.
- Chen X, Wasikowski M. FAST. A ROC-based feature selection metric for small samples and imbalanced data classification problems. Proceedings of ACM SIGKDD; 2008. P. 124–33.
- German C, Angelica M.M, Eduardo FM. A minority class feature selection method. Proceedings of Pattern Recognition, Image Analysis, Computer Vision, and Applications. 2011 Nov; 7042:417–24.
- Vapnik V. Statistical learning theory. Wiley–Interscience; 1998.
- Chawla N, Bowyer K, Hall L, Kegelmeyer P. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research. 2002; 16:321–57.
- Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence. 2004 Feb; 20(1):18–36.
- Padmaja TM, Narendra D, Bapi RS, Krishna PR. Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. Proceedings of 15th International Conference on Advanced Computing and Communications; 2007 Dec. p. 511–6.
- Thongkam J, Xu G, Zhang Y, Huang F. Toward breast cancer survivability prediction models through improving training space. Expert Systems with Applications. 2009 Dec; 36(10):12200–9.
- Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing. 2011; 17(2-3):255–87.
- Nervous System. Available from: https://en.wikipedia.org/ wiki/Nervous_system
- Levis Tonji Education-Mirror Kentridge. Available from: http://levis.tongji.edu.cn/gzli/data/mirror-kentridge.html#TIS
- Santiago DV, Cunningham P. An evaluation of dimension reduction techniques for one-class classification. Artif Intell ReV. 2007 Apr; 27(4):273–94.
- Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011; 2(27):1–27.
- Mark H, Eibe F, Geoffrey H, Bernhard P, Peter R, Ian HW. The WEKA data mining software: An update. SIGKDD explorations. 2009 Jun; 11(1):10–8.
- Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters-Special Issue: ROC Analysis in Pattern Recognition. 2006 Jun; 27(8):861–74.
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution 3.0 License.