Total views : 171

Cosine Similarity with Centroid Implication for Text Clustering of Document Files


  • Jaypee Institute of Information and Technology, Noida - 201301, Uttar Pradesh, India


Objectives: To address a pair wise text comparison of large dataset while making use of cosine similarity metric and adjacent method and to develop a model for parallel processing of giant data while using distributed algorithms on parallel clusters. Methods/Statistical Analysis: This works makes use of K-means algorithm based on map-reduce on document files with effective number of clusters in a Java environment. This work reflects an approach to classify text documents using feature selection method makes use of cosine similarity method. Within fixed number of iterations, efficient numbers of clusters have been implemented. The implementation has been carried out in Java environment. Findings: The proposed work reflects an approach to classify text documents using feature selection method. Application/Improvements: While using cosine similarity methods, the results retrieved are quite improved and acceptable.


Cosine Similarity, Document Files, Text Clustering

Full Text:

 |  (PDF views: 197)


  • Yao C, Zhang X, Bai X, Liu W, Ma Y, Tu Z. Rotation-invariant features for multi-oriented text detection in natural images.PloS one. 2013; 8(8):1–28.
  • Elberrichi Z, Rahmoun A, Bentaalah MA. Using WordNet for text categorization. The International Arab Journal of Information Technology. 2008; 5(1):16–26.
  • Wang Q, Garrity GM, Tiedje JM, James R, Cole C.Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and Environmental Microbiology. 2007; 73(16):5261–7.
  • Kazmierska J, Malicki J. Application of the naive bayesian classifier to optimize treatment decisions. Radiotherapy and Oncology. 2008; 86(2):211–6.
  • Daniel RM, Shukla AK. Improving text search process using text document clustering approach. IJSR. 2014; 3(3):14–24.ISSN: 2319-7064.
  • Lin L. Research on text classification based on SVM-KNN.5th IEEE International Conference on Software Engineering and Service Science (ICSESS); China. 2014. p. 842–4.
  • Vishwanath V. Machine learning approach for text and document mining. Computer Science. 2014; 7(1):41–8.
  • Raghuveer K, Murthy KN. Text categorization in Indian languages using machine learning approaches. IICAI; 2007.p. 1864–83.
  • Anuradha A. Neural network approach for text classification using relevance factor as term weighing method. IJCA Journal. 2013; 68(17):37–41.
  • Vora P. A survey on K-mean clustering and particle swarm optimization. International Journal of Science and Modern Engineering. 2013; 1(3):1–14.
  • Li M, Chen X, Li X, Ma B, Vitanyi PMB. The similarity metric. IEEE Transactions on Information Theory. 2004; 50(12):3250–64.
  • Li B, Lu Q, Yu S. An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP). 2004; 3(4):215–26.
  • Nigam G, Dabas C. Effective compressive sensing for clustering in wireless sensor networks. Indian Journal of Science and Technology. 2016; 9(38):1–5.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.