Total views : 291

MODC: Multi-Objective Distance based Optimal Document Clustering by GA

Affiliations

  • Department of Computer Science and Engineering, MRCE, Hyderabad - 500100, Telangana State, India
  • Department of Computer Science,Sri Venkateswara University,Tirupathi - 517502, Andhra Pradesh, India
  • Annamacharya PG College of Computer Studies, Rajampet - 516126, Andhra Pradesh, India

Abstract


Background/Objective: Unsupervised learning of text documents is an essential and significant process of knowledge discovery and data mining. The concept, context and semantic relevancy are the important and exclusive factors in text mining, where as in the case of unsupervised learning of record structured data, these factors are not in scope. Methods/Statistical Analysis: The current majority of benchmarking document clustering models is keen and relies on term frequency, and all these models are not considering the concept, context and semantic relations during document clustering. In regard to this, our earlier works introduced a novel document clustering approaches and one of that named as Document Clustering by Conceptual, Contextual and Semantic Relevance (DC3SR). The lessons learned from the empirical study of this contribution motivated us to propose aMulti-Objective Distance based optimal document Clustering (MODC) approach that optimizes resultant clusters using the well-known evolutionary computation technique called Genetic Algorithm.Findings: The significant contribution of this proposal is feature formation by concept, context and semantic relevance and optimizing resultant clusters by genetic algorithm. An unsupervised learning approach to form the initial clusters that estimates similarity between any two documents by concept, context and semantic relevance score and further optimizes by genetic algorithm is proposed. This novel method represents the concept as correlation between arguments and activities in given documents, context as correlation between meta-text of the documents and the semantic relevance is assessed by estimating the similarity between documents through the hyponyms of the arguments. The meta-text of the documents considered for context assessment contains the authors list, keywords list and list of document versioning time schedules. Application/Improvements:The experiments were conducted to assess the significance of the proposed model.The results obtained from experiments concluding that the MODC is performing exceptionally well under divergent document count and evincing the cluster formation accuracy as 97%. The dimensionality reduction by concept, context and semantic relevance is left for future enhancement of the proposed model.

Keywords

Concept Distance,Context Distance, Document Clustering, Meta-text,MODC, Multi Objective Distance Function, Text Mining, Unsupervised Learning.

Full Text:

 |  (PDF views: 229)

References


  • Berkhin P. A survey of clustering data mining techniques. Grouping Multidimensional Data. Springer-Verlag; 2006. p. 25–71.
  • Huang A. Similarity measures for text document clustering; 2008.
  • Hastie TT. Unsupervised learning. New York: Springer; 2009.
  • Fung BCM, Wan K, Ester M. Hierarchical document clustering using frequent itemsets. SDM. 2003.
  • Sedding J, Kazakov D. Wordnet-based text document clustering. 3rd Workshop on Robust Methods in Analysis of Natural Language Data; 2004. p. 104–13.
  • LI Y, Chung SM. Text document clustering based on frequent word sequences. Proceedings of the. CIKM. Bremen, Germany; 2005 Oct 31 – Nov 5.
  • Zheng H-T, Kang B-Y, Kim H-G. Exploiting noun phrases and semantic relationships for text document clustering. Information Science. 2009; 179(13):2249–62.
  • Rao AS, Ramakrishna S. DCCR: Document Clustering by Conceptual Relevance as a factor of unsupervised learning. International Journal of Scientific and Engineering Research. 2014 Oct; 5(10):2229–5518.
  • Cui XG. A flocking based algorithm for document clustering analysis. Journal of Systems Architecture. 2006:505–15.
  • Narayanan NJ. Enhanced distributed document clustering algorithm using different similarity measures. IEEE Conference on Information and Communication Technologies (ICT); 2013. p. 545–50.
  • Castillo O, Martínez-Marroquín R, Melin P, Valdez F, Soria J. Comparative study of bio-inspired algorithms applied to the optimization of type-1 and type-2 fuzzy controllers for an autonomous mobile robot. Information Sciences. 2012 Jun; 192:19–38.
  • Kang F, Li J, Ma Z. Rosenbrock artificial bee colony algorithm for accurate global optimization of numerical functions. Information Sciences. 2011; 181(16):3508–31.
  • Kundu D, Suresh K, Ghosh S, Das S, Panigrahi BK, Das S. Multi-objective optimization with artificial weed colonies. Information Sciences. 2011; 181(12):2441–54.
  • Yang X. Nature-inspired metaheuristic algorithms. Luniver Press; 2008.
  • Haupt RL, Haupt SE. Practical genetic algorithms, second ed., John Wiley and Sons; 2008.
  • Rao AS, Ramakrishna S. DC3SR: Document Clustering by Concept, Context and Semantic Relevance as factors of unsupervised learning. International Journal of Applied Engineering Research. 2015; 10(21):42213–18.
  • Carpineto C, Osiński S, Romano G, Weiss D . A survey of Web clustering engines. ACM Computing Surveys. 2009; 41(3):1–38.
  • Hammouda K. Web mining: clustering web documents a preliminary review; 2001. p. 1–13.
  • Jain AK, Dubes RC. Algorithms for clustering data; 1988.
  • Jain AK, Murthy MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys. 1999; 31(3):264–323.
  • Steinbach M, Karypis, Kumar V. A comparison of document clustering techniques. ACM Boston; 2000. p. 1–20.
  • Cobos CMV. Clustering of web search results based on the cuckoo search algorithm and balanced Bayesian information criterion. Information Sciences. 2014:248–64.
  • Park WS. Genetic algorithm for text clustering based on latent semantic indexing. Computers and Mathematics with Applications. 2009:1901–7.
  • Bolshakova N. Cluster validation techniques for genome expression data. Signal Processing. 2003:825–33.
  • Natarajan KP. Hybrid PSO and GA models for document clustering. International Journal of Advanced Soft Computing Applications. 2010:2074–8523.
  • Hasanpour EH. PSO algorithm for text clustering based on latent semantic indexing. The Fourth Iran Data Mining Conference. Tehran, Iran; 2010.
  • Hasanzadeh ME. Text clustering on latent semantic indexing with Particle Swarm Optimization (PSO) algorithm. International Journal of the Physical Sciences. 2012:116–20.
  • Karol VS. Evaluation of a text document clustering approach based on Particle Swarm Optimization. CSI Journal of Computing. 2012.
  • Nihal M, AbdelHamid MB. Bees algorithm-based document clustering. ICIT 2013 The 6th International Conference on Information Technology; 2013.
  • KayvanAzaryuon BF. A novel document clustering algorithm based on ant colony optimization algorithm. Journal of Mathematics and Computer Science. 2013:171–80.
  • Nagaraj R, Thiagarasu V. Correlation similarity measure based document clustering with directed ridge regression. Indian Journal of Science and Technology. 2014; 7(5):692–97. DOI: 10.17485/ijst/2014/v7i5/50135.
  • Devi SS, Shanmugam A. An integrated harmony search method for text clustering using a constraint based approach. Indian Journal of Science and Technology. 2015; 8(29). DOI: 10.17485/ijst/2015/v8i29/73986.
  • Layzer D. Genetic variation and progressive evolution. American Naturalist. 1980:809–26.
  • Tabachnick BG, Fidell LS, Osterlind SJ. Using multivariate statistics. Pearson; 2001.
  • Ihaka R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996:299–314.
  • Sajana T, Rani CMS, Narayana KV. A survey on clustering techniques for big data mining. Indian Journal of Science and Technology. 2016 Jan; 9(3). DOI: 10.17485/ijst/2016/v9i3/75971.
  • Hariharan R, Mahesh C, Prasenna P, Kumar RV. Enhancing privacy preservation in data mining using cluster based greedy method in hierarchical approach. Indian Journal of Science and Technology. 2016 Jan; 9(3). DOI: 10.17485/ijst/2016/v9i3/86386.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.