Total views : 296

Efficient Search for Data with Numerical and Categorical Attributes Based on Dual Locality-Sensitive Hashing


  • Department of Computer Science, Chungbuk National University, Chungdae-ro 1, Seowon-ku, Cheongju, 28644, Korea, Republic of


Background/Objectives: Similarity search is a fundamental task in many domains. For large volumes of data, it is sometimes preferable to obtain approximate results instead of pursuing exact results that require long computation times. Methods/Statistical Analysis: The proposed method expresses a query using a fuzzy ball for the numerical data space and a conjunction of ground or general terms for categorical domains. It exploits a dual locality-sensitive hashing technique that constructs separate locality-sensitive hashing tables for the numerical and categorical data spaces. It determines buckets for a query by considering the relative distances of the numerical part of the query to the subspace boundaries and using concept hierarchies for the categorical attributes. It may also use a Bloom filter to select the candidate data set to be examined from among the determined buckets. Findings: The proposed approximate search technique was applied to two data sets. The portions of data to be examined were 0.02712% and 0.00084% for the first and the second data set. The average numbers of numerical buckets examined were 1.42 and 2.53 for the data sets, respectively. The figures mean that the proposed method significantly reduces the number of candidates to be considered and hence saves greatly computation cost. In addition, the proposed method is unique that it can be applied to data set with both numerical and categorical attributes. Improvements/Applications: The proposed locality-sensitive hashing method can be applied to approximate search tasks for large volume of data containing both numerical and categorical attributes.


Locality-Sensitive Hashing, Nearest Neighbor Search, Similarity Search, Space Partitioning.

Full Text:

 |  (PDF views: 222)


  • Zezula P, Amato G, Dohnal D, Batko M. Similarity search: The metric space approach. Advances in Database Systems. Purdue University, 2006.
  • Omohundro SM. Five balltree construction algorithms. Technical Report. International Computer Science Institute; 1989.
  • Bentley JL. Multidimensional binary search trees used for associative searching. Communications of the ACM. 1975; 18(9):509-17.
  • Ruan Da. Fuzzy Set Theory and Advanced Mathematical Applications. Springer Science & Business Media; 2013.
  • Cormen TH, Leiserson CE, Rivest RL. Introduction to Algorithms. The MIT Press; 2001.
  • Sugguna S, Ranjith KC, Sheela JC. State of the Art: A Summary of Semantic Image and Video Retrieval Techniques. Indian Journal of Science and Technology. 2015; 8(35):1-12.
  • Lee J, Kim DW. Ranking Tag Pairs for Music Recommendation Using Acoustic Similarity. International Journal of Fuzzy Logic and Intelligent Systems. 2015; 15(3):159-165.
  • Ganeshkumar K, Arivazhagan D. Generating a Digital Signature Based on New Cryptographic Scheme for User Authentication and Security. Indian Journal of Science and Technology. 2014; 7(6):1-5.
  • Indyk P, Motwani R. Approximate nearest neighbors: Towards removing the curse of dimensionality. Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing; 1998. p. 604-13.
  • Kim YG, Lee KM. Association-based Outlier Detection for Mixed Data. Indian Journal of Science and Technology. 2015: 8(25):1-6.
  • Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. Proceedings of the 25th VLDB Conference; 1999. p. 518-29.
  • Lee KM. Locality-sensitive hashing techniques for nearest neighbor search. International Journal of Fuzzy Logic and Intelligent Systems. 2012; 12(4):308-318.
  • Rajaraman A, Ullman JD. Mining of Massive Datasets. Cambridge University Press. 2011.
  • Nister D, Stewenius H. Scalable recognition with a vocabulary tree. Proceedings of IEEE CVPR; 2006. p. 2161-68.
  • Shakhnarovich G, Viola P, Darrell T. Fast pose estimation with parameter sensitive hashing. Proceedings of IEEE Int. Conference on Computer Vision; 2003. p. 750-57.
  • Salakhutdinov RR, Hinton GE. Semantic hashing. International Journal of Approximate Reasoning. 2009; 50(7):96978.
  • Jin Z, Li C, Lin Y, Cai D. Density sensitive hashing. IEEE Transactions on Cybernetics. 2014; 44(8):1362-71.
  • Weiss Y, Torralba A, Fergus R. Spectral hashing. Proceedings of Neural Information Processing Systems; 2008. p. 1753-60.
  • Lee KM. Locality-sensitive hashing for fata with categorical and numerical attributes using dual hashing. International Journal of Fuzzy Logic and Intelligent Systems.2014; 14(2):308-18.
  • Bishop CM. Pattern Recognition and Machine Learning. Springer, 2006.
  • Scholkopf B, Smola AJ. Learning with kernels. The MIT Press. 2002.
  • Lee KM. Locality sensitive hashing with extended partitioning boundaries. Applied Mechanics and Materials. 2013; 321-324:804-07.
  • Mendel JM. On a novel way of processing data that uses fuzzy sets for later use in rule-based regression and pattern classification. International Journal of Fuzzy Logic and Intelligent System. 2014; 14(1):1-7.
  • Lee KM, Lee KM, Lee CH. Statistical cluster validity indexes to consider cohesion and separation. Proceedings of 2012 International Conference on Fuzzy Theory and its Applications; 2012. p. 228-32.
  • Lee KM. Mining generalized fuzzy quantitative association rules with fuzzy generalization hierarchies. Proceedings of IFSA World Congress and 20th NAFIPS International Conference; 2001. p. 2977-82.
  • Lee KM, Lee KM. A Locality Sensitive Hashing Technique for Categorical Data. Applied Mechanics and Materials. 2013; 241-244:3159-64.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.