Total views : 149

COCUS: Concept Based Document Clustering by Corpus Utility Scale


  • Department of Computer Science and Engineering, Koneru Lakshamaiah Education Foundation, Guntur– 522502, Andhra Pradesh, India


Objective: With the rising quantum of documents in corpuses, it is very important that data management and data assurance is with high interoperability towards retrieving the critical documents from vast range of services. By focusing on the semantic features, which could improve the level of accuracy in document tracing and retrieval, the issues and limitations in the present models could be addressed in an effective manner. Methods/Statistical Analysis: In this study, focus is on depicting the robustness of semantic features based clustering techniques and its efficacy, compared to the other kind of clustering techniques. This paper proposed a concept based document clustering by corpus utility scale (COCUS) proposed. The utility scale proposed in COCUS is derived with support of topic related selected document set as knowledge base that enables to cluster the documents by their concept relevancy. The proposed clustering model is assessed through the state of the art metrics called cluster purity, inverse of purity and cluster level harmonic mean. Experiments were carried out on datasets that comprise the containing specific kind of literature gathered from varied open access journals from publishers. The total 1509 number of documents was collected and among them 497 documents was used as knowledgebase and rest 1012 documents were used for clustering process. Findings: The experimental study evincing that the proposed model is scalable and robust. The purity and harmonic mean of the resultant clusters confirming that the COCUS clusters the documents by their concept relevancy with 94% accuracy (Average of the topic level harmonic mean of the clusters was found as 0.94). Application/ Improvements: The computational complexity of the COCUS is evinced as linear, where the majority of benchmarking models are found to be np-hard.


Cluster, Corpus Utility, Harmonic Mean, Text Mining.

Full Text:

 |  (PDF views: 122)


  • Tang B, Shepherd M, Milios E, Heywood MI. Comparing and combining dimension reduction techniques for efficient text clustering. In Proceeding of SIAM International Workshop on Feature Selection for Data Mining 2005. 17–26 .
  • Sammut C, Webb GI, editors. Encyclopedia of machine learning. Springer Science and Business Media. 2011 Mar 28.
  • Everitt B. Introduction to optimization methods and their application in statistics. Springer Science and Business Media. 2012 Dec 6.
  • Kowalski GJ, Maybury MT. Information storage and retrieval systems: Theory and implementation. Springer Science and Business Media. 2006 Apr 11.
  • Buckley C, Lewit AF. Optimization of inverted vector searches. InProceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval. 1985 Jun 5; 97–110. Available from: Crossref
  • Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/ gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. 1992 Jun 1; 318–29.
  • Zamir O, Etzioni O, Madani O, Karp RM. Fast and Intuitive Clustering of Web Documents. KDD. 1997 Aug 14; 97: 287–90.
  • Delafrooz N, Farzanfar E. Determining the customer lifetime value based on the benefit clustering in the insurance industry. Indian Journal of science and Technology. 2016 Jan 7;9(1):1–8.
  • Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In KDD workshop on text mining. 2000 Aug 20; 400(1): 525–6.
  • Andrews NO, Fox EA. Recent developments in document clustering. Technical report, Computer Science, Virginia Tech. 2007 Oct 16.
  • Wang X, Tang J, Liu H. Document clustering via matrix representation. In 2011 IEEE 11th International Conference on Data Mining. 2011 Dec 11; 804–13. IEEE. Available from: Crossref
  • Nadig R, Ramanand J, Bhattacharyya P. Automatic evaluation of word net synonyms and hyponyms. In Proceedings of ICON-2008: 6th International Conference on Natural Language Processing. 2008;831.
  • Aas K, Eikvil L. Text categorisation: A survey. Technical Report 941. Oslo Norway: Norwegian Computing Center. 1999.
  • Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information processing and management. 1988 Dec 31; 24(5): 513–23.
  • Sajana T, Rani CS, Narayana KV. A Survey on Clustering Techniques for Big Data Mining. Indian Journal of Science and Technology. 2016 Feb 8; 9(3): 1–12.
  • Mishra SP, Mishra D, Patnaik S. An empirical analysis on effect of data expansion for clustering low dimensional data. Indian Journal of Science and Technology. 2016 Feb 9;9(3): 1–21.
  • Huang A. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008 Apr 14; 49–56.
  • Li F, Zhu Q. Dcoument Clustering in Research Literature Based on NMF and Testor Theory. Journal of Software. 2011 Jan 1;6(1): 78–82.
  • Kumar N, Srinathan K. A New Approach for Clustering Variable Length Documents. In Advance Computing Conference, 2009. IACC 2009. IEEE International 2009 Mar 6; 982–7. Available from: Crossref
  • Luo C, Li Y, Chung SM. Text document clustering based on neighbors. Data and Knowledge Engineering. 2009 Nov 30; 68(11):1271–88. Available from: Crossref
  • Ni X, Quan X, Wenyin, L. Short text clustering by finding core terms. Journal of Knowledge and Information Systems, Springer Link. 2010; 27(3): 345–65
  • Bharathi G, Vengatesan D. Improving information retrieval using document clusters and semantic synonym extraction. Journal of Theoretical and Applied Information Technology. 2012 Feb; 36(2): 167–73.
  • Pessiot JF, Kim YM, Amini MR, Gallinari P. Improving document clustering in a learned concept space. Information processing and management. 2010 Mar 31; 46(2): 180–92.
  • Li Y, Chung SM, Holt JD. Text document clustering based on frequent word meaning sequences. Data and Knowledge Engineering. 2008 Jan 31; 64(1): 381–404.
  • Bollegala D, Matsuo Y, Ishizuka M. A web search enginebased approach to measure semantic similarity between words. IEEE Transactions on Knowledge and Data Engineering. 2011 Jul; 23(7): 977–90.
  • Kaiser F, Schwarz H, Jakob M. Using Wikipedia-based conceptual contexts to calculate document similarity. Third IEEE International Conference on Digital Society. 2009. ICDS’09. 2009 Feb 1; 322–7. Available from: Crossref
  • Shehata S, Karray F, Kamel M. An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering. 2010 Oct; 22(10):1360–71.
  • Baghel R, Dhir R. A Frequent Concepts Based Document Clustering Algorithm. International Journal of Computer Applications. 2010 Jul; 4(5): 6–12.
  • Hammouda KM, Kamel MS. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on knowledge and data engineering. 2004 Oct; 16(10): 1279–96.
  • Huang R, Lam W. An active learning framework for semi-supervised document clustering with language modeling. Data and Knowledge Engineering. 2009 Jan 31;68(1): 49–67.
  • Wang F, Zaniolo C. Temporal queries and version management in XML-based document archives. Data and Knowledge Engineering. 2008 May 31; 65(2): 304–24.
  • Chehreghani MH, Abolhassani H, Chehreghani MH. Improving density-based methods for hierarchical clustering of web pages. Data and Knowledge Engineering. 2008 Oct 31; 67(1): 30–50.
  • Algergawy A, Schallehn E, Saake G. Improving XML schema matching performance using Prüfer sequences. Data and Knowledge Engineering. 2009 Aug 31; 68(8): 728–47.
  • Delibašić B, Vukićević M, Jovanović M, Kirchner K, Ruhland J, Suknović M. An architecture for component-based design of representative-based clustering algorithms. Data and Knowledge Engineering. 2012 May 31; 75: 78–98.
  • Zhang T, Tang YY, Fang B, Xiang Y. Document clustering in correlation similarity measure space. IEEE Transactions on Knowledge and Data Engineering. 2012 Jun; 24(6): 1002–13.
  • Hu X, Zhang X, Lu C, Park EK, Zhou X. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009 Jun 28; 389–96. Crossref
  • Prathima Y, Supreethi KP. A survey paper on concept based text clustering. International Journal of Research in IT and Management. 2011;1(3): 45–60.
  • Miller GA. Word Net: a lexical database for English. Communications of the ACM. 1995 Nov 1; 38(11): 39–41.
  • Zhao Y, Karypis G. Criterion functions for document clustering: Experiments and analysis.
  • Van Rijsbergen CJ. Foundation of evaluation. Journal of Documentation. 1974 Apr 1; 30(4): 365–73.
  • Larsen B, Aone C. Fast and effective text mining using lineartime document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining 1999 Aug 1; 16–22. Available from: Crossref
  • Nikhath AK, Subrahmanyam K. Incremental Evolutionary Genetic Algorithm Based Optimal Document Clustering (ODC). Journal of Theoretical and Applied Information Technology. 2016 May 31; 87(3).
  • I haka R, Gentleman R. R: a language for data analysis and graphics. Journal of computational and graphical statistics. 1996 Sep 1; 5(3): 299–314.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.