Total views : 142

Using Genetic Approach for Learning from Imbalanced Text Corpora

Affiliations

  • School of Information Technology and Engineering, VIT University, Vellore - 632014, Tamil Nadu, India
  • SCOPE, Vellore Institute of Technology University, Vellore - 632014, Tamil Nadu, India

Abstract


Aiming at the ever-present problem of imbalanced data in text classification, the paper employs the Genetic Algorithm approach for tackling the imbalance problem in a binary classed text data. One of the inherent characteristics of imbalanced data is the highly uneven distribution of data among the classes. Consequentially, the traditional classifier algorithms such as the Nearest Neighbor have shown a decreased performance due to the under representation of the interested class. A hybrid approach has been used to incorporate the oversampling technique with the advantages of Genetic Algorithm for generation of the artificial patterns for the minority class. This approach employs avoidance of over fitting as the fitness function to decide the stopping criterion for generation of synthetic samples. Efficient evaluation measures analyze the increase in performance of the proposed hybrid-learning model.

Keywords

Genetic Algorithm, Imbalance Data, Nearest Neighbor, Oversampling, Synthetic Data, Text Data.

Full Text:

 |  (PDF views: 159)

References


  • Japkowicz N. Learning from imbalanced data sets: A comparison of various strategies. Proceedings of Learning from Imbalanced Data Sets, AAAI Work Shop. Technical Report; 2000.
  • Liu Y, Loh HT. Corpus building for corporate knowledge discovery and management: A case study of manufacturing.Proceedings of the 11th International Conference on Knowledge-based and Intelligent Information and Engineering Systems, KES’07, Lecture notes in artificial intelligence, LNAI. Vietri sul Mare, Italy. 2007; 4692:542– 50.
  • Weiss GM. Mining with rare cases. O. Maimon, L. Rokach, editors. The Data Mining and Knowledge Discovery Handbook, Springer; 2005. p. 765–76.
  • Weiss GM, The impact of small disjuncts on classifier learning.R. Stahlbock, S. F. Crone, S. Lessmann, editors. Data Mining: Annals of Information Systems. Springer. 2010; 8:193–226.
  • Cortes C, Vapnik V. Support Vector Networks. Machine Learning. 1995; 20:273–97.
  • Denil M, Trappenberg T. Overlap versus imbalance.Proceedings of the 23rd Canadian Conference on Advances in Artificial Intelligence (CCAI’10), Lecture Notes on Artificial Intelligence. 2010; 6085:220–31.
  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.SMOTE: Synthetic Minority Over-Sampling Technique.Journal of Artificial Intelligent Research. 2002; 16:321–57.
  • Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A.An empirical study of the classification performance of learners on imbalanced and noisy software quality data.Information Sciences; 2013. Available from: http://dx.doi.org/10.1016/j.ins.2010.12.016
  • Han H, Wang WY, Mao BH. Borderline–SMOTE: A new over–sampling method in imbalanced data sets learning.Proceedings of the 2005 International Conference on Intelligent Computing (ICIC’05), Lecture Notes in Computer Science. 2005; 3644:878–87.
  • He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IJCNN’08); 2008. p. 1322–8.
  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C.Safe–level–SMOTE: Safe–level–Synthetic Minority Over– Sampling Technique for handling the class imbalanced problem. Proceedings of the 13th Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’09; 2009. p. 475–82.
  • Stefanowski J, Wilk S. Selective pre-processing of imbalanced data for improving classification performance.Proceedings of the 10th International Conference on Data Warehousing and Knowledge, Discovery (DaWaK08); 2008. p. 283–92.
  • Batuwita R, Palade V. Class imbalance learning methods for Support Vector Machines. H. He, Y. Ma, editors.Imbalanced Learning: Foundations, Algorithms and Applications; Wiley. 2013. p. 83–96.
  • Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A.An empirical study of the classification performance of learners on imbalanced and noisy software quality data.Information Sciences. 2013. Available from: http://dx.doi.org/10.1016/j.ins.2010.12.016
  • Garcia-Pedrajas N, Perez-Rodriguez J, Garcia-Pedrajas M, Ortiz-Boyer D, Fyfe C. Class imbalance methods for translation initiation site recognition in DNA sequences.Knowledge Based Systems. 2012; 25(1):22–34.
  • Choi JM. A selective sampling method for imbalanced data learning on Support Vector Machines. Iowa State University. 2010.
  • Gunal S. Hybrid feature selection for text classification.Turkish Journal of Electrical Engineering and Computer Sciences. .2012; 20(2):1296–311.
  • Uysal AK, Gunal S. Text classification using Genetic Algorithm oriented latent semantic features. Expert Systems. 2014; 41(13):5938–47.
  • Wasikowski M, Chen XW, Combating the small sample class imbalance problem using feature selection. IEEE Transactions on Knowledge and Data Engineering. 2010; 22(10):1388–400.
  • Choi JM. A selective sampling method for imbalanced data learning on Support Vector Machines. Iowa State University. 2010.
  • Maheshwari S, Agrawal J, Sharma S. A new approach for classification of highly imbalanced datasets using evolutionary algorithms. International Journal of Scientific and Engineering Research. 2011 Jul; 2(7):1–5. ISSN 2229-5518.
  • Maheshwari S, Agrawal J, Sharma S. A new approach for classification of highly imbalanced datasets using evolutionary algorithms. International Journal of Scientific and Engineering Research. 2011 Jul; 2(7):1–5. ISSN 2229-5518.
  • Available from: http://www.cad.zju.edu.cn/home/dengcai/ Data/TextData.html

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.