Total views : 202

Sentence Clustering: A Comparative study


  • Department of Computer Science and Engineering, IIIT, Bhubaneswar -751003, Odisha, India


Clustering is one of the important steps in most of the text mining/information retrieval task like text summarization, domain identification etc. We are working on automatic abstractive text summarization where we require finding out tightly coupled sentences that could be merged and compressed to generate compact abstract summary. For our pipeline we require a clustering technique that does not take no. of clusters to be formed in advance as input.Therefore, we studied two important clustering techniques density based DBSCAN and graph based Markov Clustering Algorithm (MCL) in association with some sentence level relationships. Both the clustering techniques do not require no. of clusters to be formed in advance which is needed to generate summary of a text document without any intervention.Evaluation of sentence clustering is done using purity metric. Purity of both the sentence clustering technique is compared with baseline K-means clustering technique.MCL with some sentence level features performs better than others and fits into our pipeline.


Clustering, DBSCAN, Markov Clustering, Transition Relationship, Anaphoric Relationship.

Full Text:

 |  (PDF views: 187)


  • Shen C, Li T, Ding CHQ. Integrating clustering and multidocument summarization by bi-mixture probabilistic latent semantic analysis (Plsa) with sentence bases. Association for the Advancement of Artificial Intelligence; 2011.
  • Wang D, Zhu S, Li T, Gong Y. Multi-documentsummarization using sentence-based topic models. In Proceedings of the ACL-IJCNLP, Stroudsburg, PA, USA; 2009. p. 297–300.
  • Nomoto T, Matsumoto Y. A new approach to unsupervised text summarization. Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, New York, NY, USA; 2001. p. 26–34.
  • Boros E, Kantor PB, Neu DJ. A clustering based approach to creating multi-document summaries. Workshop, on Text Summarization (DUC 2001) in conj. with ACM SIGIR Conference, New Orleans, Louisiana; 2001.
  • Aas K, Eikvil L. Text categorisation: A survey. Technical Report 941, Norwegian Computing Center; 1999 Jun.
  • Salton G, Wong A, Yang C. A vector space model for automatic indexing. Comm. ACM. 1975 Nov; 18(11):613–20.
  • Salton G, McGill MJ. Introduction to modern information retrieval.McGraw-Hill Computer Science Series, New York: McGraw-Hill; 1983.
  • Salton G. Automatic text processing: The transformation, analysis, and retrieval of information by computer.Reading, Mass.: Addison Wesley; 1989.
  • Radev D, Winkel A, Topper M. Multidocument centroid-based text summarization. Proceedings of the ACLDemonstration Session, Pennsylvania, US; 2012.p. 112–13.
  • Radev DR, Jing H,Budzikowska M. Centroid based summarization of multiple documents: sentence extractionutilitybased evaluation, and user studies. ACL/NAAL Workshop on Summarization, Seatle, WA.; 2000
  • Hammouda KM, Kamel MS. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering. 2004 Oct; 16(10).
  • Okazaki N, Matsuo Y, Matsumura N, Ishizuka M.Activation with refined similarity measure. Proceedings 16th International Florida Artificial Intelligence Research Society Conference. (FLAIRS ’03); 2003. p. 407–11.
  • Chieu HL, Lee YK. Query based event extraction along a timeline. Proceedings of 27th Annual International ACM SIGIR Conference of Research and Development in Information Retrieval (SIGIR ’04); 2004. p. 425–32.
  • Zha H. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering.SIGIR’02, 2002 Aug 11–15, Tampere, Finland; 2002.
  • Erkan G, Radev DR. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research. 2004 Dec; 22(1):457–79.
  • Kotlerman L, Dagan I, Gorodetsky M, Daya E. Sentence clustering via projection over term clusters. First Joint Conference on Lexical and Computational Semantics, Montréal, Canada; 2012 Jun. p. 38–43.
  • Chen KY, Luesukprasert L, Chou SCT. Hot topic extraction based on timeline analysis and multidimensional sentence modelling.IEEE Transactions on Knowledge and Data Engineering. 2007 Aug; 19(8):1016–25.
  • Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/ Gather: A cluster-based approach to browsing large document collections. Proceedings of 15th Annual International ACM SIGIR Confernce Research and Development in Information Retrieval, SIGIR; 1992.
  • Hatzivassiloglou V, Klavans J, Holcombe M, Barzilay R, Kan M-Y, McKeown K. SIMFINDER: A flexible clustering tool for summarization. Proceedings of the Workshop on Summarization in NAACL; 2001.
  • Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial database with noise; 1996.
  • Van Dongen S. A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam; 2000.
  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: An update.SIGKDD Exploration. 1999; 11(1).
  • Available from: evaluation-of-clustering-1.html.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.