Total views : 314

Feature Selection for Automatic Categorization of Patent Documents


  • Department of Analytics, School of Computer Science and Engineering, VIT University, Vellore - 632014, Tamil Nadu, India
  • Department of Computer Science and Engineering, Konkuk University, Seoul, Korea, Republic of


Objective: With the rapid increase in the number of patent documents worldwide, demand for their automatic categorization has grown significantly. The automatic categorization of patent documents is the organization of such documents in digital form, thus replacing the manual time-consuming process. In this work, we proposed a system that can automatically categorize patent document by considering the structural information of the patents. Methods: We propose a three-stage mechanism for automatic categorization. In the first stage, we apply a pre-processing mechanism to reduce unwanted noise that can influence the categorization process. Such noise includes terms that have less structural meaning in the document. In the second stage, feature selection is conducted based on the term frequencies. Feature vectors are constructed from the structural information of the patent. In the third stage, classifications are conducted using a Random Forest (RF), Support Vector Machine (SVM), and Naïve Bayes (NB) classifier. Findings: It was found that the semantic structural information of a patent document is an important feature set in constructing the terms of a document for the categorization. The experimental results also show that feature reduction using Information Gain (IG) is beneficial for obtaining a higher accuracy rate in a reduced dimensional space. Applications: The results reveal the importance of the proposed method for automatic categorization of patent documents.


Classification, Feature Selection, Patent categorization, Structural information.

Full Text:

 |  (PDF views: 279)


  • Holger Ernst , Patent information for strategic technology management, World Patent Information, 2003,Vol.25,No.3, 233-242.
  • May Christopher, The world intellectual property organization,New Political Economy, 2006,Vol.11, No.3, 435-445.
  • ZhongquanXie and Kumiko Miyazaki, Evaluating the effectiveness of keyword search strategy for patent identification,World Patent Information,2013,Vol.35, No.1, 20-30.
  • David D. Lewis and Marc Ringuette, A Comparison of Two Learning Algorithms for Text Categorization,3rd Annual Symposium on DAIR,1994,81-93.
  • Jiali Yun , Liping Jing , Jian Yu and Houkuan Huang , A multi-layer text classification framework based on two-level representation model, Expert Systems with Applications, 2012,Vol.39,No.2,2035-2046.
  • Harun Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems,2011,Vol.24,No.7,1024-1032.
  • Jae-Ho Kim and Key-Sun Choi ,Patent document categorization based on semantic structural information, Information Processing and Management ,2007,Vol.43,No.5,1200-1215.
  • Yang Yiming and Pedersen Jan O , A Comparative Study on Feature Selection in Text Categorization,ICML ’97,412-420.
  • Mustafa Karabulut ,Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection,Knowledge-Based Systems,2013,Vol.54,288-297.
  • Stephen Adams , The text, the full text and nothing but the text: Part 2 The main specification, searching challenges and survey of availability,World Patent Information,2010,Vol.32,No.2,120-128.
  • C. Koster, M. Seutter, and J. Beney, Multi-classification of Patent Applications with Winnow.In Proceedings PSI 2003, (Springer LNCS 2890), 545-554.
  • JianfeiZhang,Lifei Chen and GongdeGuo , Projected-prototype based classifier for text categorization,2013,Vol.49,179-189.
  • Leo Wanner and Ricardo Baeza-Yates and SrenBrgmann and Joan Codina and Barrou Diallo and EnricEscorsa and Mark Giereth and YiannisKompatsiaris and Symeon Papadopoulos and EmanuelePianta and Gemma Piella and Ingo Puhlmann and Gautam Rao and Martin Rotard and PiaSchoester and Luciano Serafini and VasilikiZervaki , Towards content-oriented patent document processing , World Patent Information,2008,Vol.30,No1,21 33.
  • C. J. Fall, A. Trcsvri, K. Benzineb, and G. Karetka, Automated categorization in the international patent classification. SIGIR Forum 2013,Vol.37, No.1,10-25.
  • Zhaohui Zheng , Feature selection for text categorization on imbalanced data, ACM SIGKDD Explorations Newsletter, 2004,Vol.6,80-89.
  • YunYun Yang ,Lucy Akers,ThomasKlose and Cynthia Barcelon Yang,Text mining and visualization tools Impressions of emerging capabilities,World Patent Information,2008,Vol.30,No.4,280-293.
  • Marc Krier and Francesco Zacc, Automatic categorisation applications at the European patent office,World Patent Information ,2002,Vol.24,No.3, 187-196.
  • A. Dasgupta, P. Drineas, B. Harb, V. Josifovski and M.W. Mahoney , Feature Selection Methods for Text Classification, Proc. 13th Int’,l Conf. Knowledge Discovery and Data Mining (KDD ’,07),230-239.
  • Breiman L , Random Forest,2001,45,5-32.
  • Breiman L, Friedman J, Olshen R and Stone C,Classification and regression Trees,Monteret CA, Wadsworth and Brooks, 1984
  • Alper Kursat Uysal and SerkanGunal , A novel probabilistic feature selection method for text classification,Knowledge-Based Systems, 2012,Vol.36, 226-235.
  • Dario Bonino and Alberto Ciaramella and FulvioCorno, Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics, World Patent Information ,2010,Vol.32,No.1,30-38.
  • Shang, Changxing , Li Min , Feng Shengzhong ,Jiang, Qingshan and Fan, Jianping , Feature Selection via Maximizing Global Information Gain for Text Classification,Knowledge-Based Systems ,2013,Vol.54,298-309.
  • Joachims, T ,Text categorization with support vector machines. Technical report, LS VIII Number 23, University of Dortmund.1997
  • David D. Lewis , Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval,ECML ’98 Proceedings of the 10th European Conference on Machine Learning,1998,4-15.
  • C. Apte, F.Damerau and S.Weiss,Towards language independent automated learning of text categorization models,In proceedings of the 17th Annual ACM/SIGIR conference,1994
  • DD.Lewis,Reuter-21578, Text categorization Test Collection, Distribution ,1997,
  • T.Joachims, A statistical learning model of text classification for support vector machine,proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, publishing new Orleans, Louisiana, US,2001,128-136.
  • Forman George, An Extensive Empirical Study of Feature Selection Metrics for Text Classification,2003,Vol.3,No.3,1289-1305.
  • Larkey Leah S, A Patent Search and Classification System, Proceedings of the Fourth ACM Conference on Digital Libraries,1999,179-187.
  • WIPO , Strasbourg agreement concerning the international patent classification, Legislative text WOO26EN,2007.
  • Changxing Shang,Min Li ,Shengzhong Feng, Qingshan Jiang and Jianping Fan, Feature selection via maximizing global information gain for text classification,2013,Vol.54,298-309.
  • Breiman, Leo, Bagging Predictors,1996,Vol.24,No.2,123-140.
  • Trevor Hastie, Robert Tibshirani, Jerome Friedman,The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition,2011 .
  • Nigam, Kamal, McCallum, Andrew Kachites ,Thrun Sebastian and Mitchell, Tom ,Text Classification from Labeled and Unlabeled Documents Using EM,2000,Vol.39,No(2-3),103-134.
  • V.Vapnik(1999) The nature of statistical learning theory, Springer (1999)
  • E.Leopold, J. Kindermann, Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?2002,,Vol.46,No.(1-3),423-444.
  • StefanosVrochidis , Symeon Papadopoulos, Anastasia Moumtzidou, PanagiotisSidiropoulos, EmanuellePianta and IoannisKompatsiaris, Towards content-based patent image retrieval: A framework perspective, World Patent Information,2010,Vol.32,No.2,94 -106.
  • F. Piroi, M. Lupu, A. Hanbury, and V. Zenz, Clef-ip 2011: Retrieval in the intellectual property domain. In CLEF (Notebook Papers/Labs/Workshop),2011.
  • Leo Breiman, Jerome H Friedman, Richard A Olshen, Charles J Stone, Classification and Regression Trees, CRC Press, New York,1999.
  • Verberne, S., Dhondt, E. , Patent classification experiments with the linguistic classification system LCS in CLEF-IP 2011. In Proceeding of: CLEF 2011 labs and workshop, notebook papers,2011,19-22.
  • Derieux, F., Bobeica, M., Pois, D., Raysz, J.P, Combining semantics and statistics for patent classification. In: CLEF (Notebook Papers/LABs/Workshops),2010.
  • Benzineb, K. and J. Guyot, Automated Patent Classification, in Current Challenges in Patent Information Retrieval,2011,239-261.
  • Andreas Hotho, Andreas Nrnberger, and Gerhard Paa. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology,2005,Vol. 20,No.1,19-62.
  • Fabrizio Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv.2002,Vol.34,No.1,1-47.
  • M. Lupu and A. Hanbury, Patent Retrieval. Foundations and Trends in Information Retrieval,2013,Vol.7,No.1,1-97.
  • Juan Carlos Gomez, Marie-Francine Moens, A Survey of Automated Hierarchical Classification of Patents, Professional Search in the Modern World,2014,Vol. 8830, 215-249


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.