Total views : 285

Big Data Mining Techniques

Affiliations

  • Department of Computer Engineering, Jamia Millia Islamia, Delhi, India

Abstract


Objectives: The objective of this research work is to discuss the various techniques which can be used for mining of big data viz. sampling, incremental learning, and distributed learning. Methods: For this study, literature survey was done to identify the various techniques employed by different authors to handle large (and streaming) data sets. For each technique, one or more algorithm was chosen and applied on large data sets. The platform for each technique was standardized (R libraries were used for each algorithm). The algorithms were compared on accuracy and time-consumed. Findings: The findings of this research work which conform to the existing literature is that the distributed learning is the best approach in terms of accuracy and time-complexity, for large data sets. However, if the data sets are streaming data sets and we want to perform real-time analysis then sampling or incremental approach are better than distributed approach. Incremental approach provides better accuracy, whereas sampling reduces time-complexity. Novelty: This study is important in the sense that it brings all the three techniques together on a single platform, which hasn’t been done earlier.

Keywords

Big Data, Data Mining, Distributed Learning, Incremental Learning, Sampling.

Full Text:

 |  (PDF views: 342)

References


  • Diebold F. Big Data: Dynamic Factor Models for Macroeconomic Measurement and Forecasting. Eighth World Congress of the Econometric Society. 2000.
  • Laney D. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note. 2001 Feb 6, p. 1-4.
  • Parthiban P, Selvakumar S. Big Data Architecture for Capturing, Storing, Analyzing and Visualizing of Web Server Logs. Indian Journal of Science and Technology. 2016 Jan; 9(4). Doi: 10.17485/ijst/2016/v9i4/84173
  • Kim KW, Park WJ, Park ST. A Study on Plan to Improve Illegal Parking using big Data. Indian Journal of Science and Technology. 2015 Sep; 8(21). Doi: 10.17485/ijst/2015/v8i21/78274
  • Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message-Passing Interface.MIT Press. 1999.
  • Borthakur D. HDFS architecture guide. Hadoop Apache Project. 2008, p.1-13.
  • Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of ACM. 2008, 51(1):107-13.
  • Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. ACM SIGMOD international conference on Management of data. ACM. 2008, p. 1099-110.
  • Thushoo A, Sharma J, Jain N, Shao Z, Chakka P, Anthony S. Hive: a warehousing solution over a map-reduce framework. 2009; 2(2):1626-29.
  • Zaharia M, Chowdhury M, Franklin M, Shenker S, Stoica I. Spark: cluster computing with working sets. USENIX Conference on Hot Topics in Cloud Computing. 2010, p. 10-10.
  • Toshniwal A, Taneja S, Shukla A, Patel J, Kulkarni S, Jackson J. Storm @Twitter. ACM SIGMOD International Conference on Management of Data. 2014, p. 147-56.
  • Neumeyer L. S4: Distributed Stream Computing Platform. IEEE International Conference on Data Mining Workshops (ICDMW). 2010, p. 170-77.
  • Nickolls J, Dally W. The GPU Computing Era. 2010; 30(2):56-9.
  • Hartigan J. Clustering Algorithms. Wiley, 1975.
  • Langley P, Thompson K. An analysis of Bayesian classifiers. Tenth National Conference on Artificial Intelligence. 1992, p. 223-28.
  • Sajana T, Rani CMS, Narayana KV. A Survey on Clustering Techniques for Big Data Mining. Indian Journal of Science and Technology, 2016 Jan; 9(3). Doi: 10.17485/ijst/2016/v9i3/75971
  • Widmer G, Kubat M. Learning in the presence of concept-drift and hidden context. Journal of Machine Learning. 1996; 23(1):69-101.
  • Vitter J. Random sampling with a reservoir. ACM Transactions on Mathematical Software. 1985; 11(1):37-57.
  • Hulten G, Domingos P. VFML - A toolkit for mining high-speed time-changing data stream. http://www.cs.washington.edu/dm/vfml/. 2003.
  • Giraud-Carrier C. A note on the utility of incremental learning. AI Communications. 2000; 13(4):215-23.
  • Shindler M, Wong A, Meyerson A. Fast and Accurate k-means for Large Datasets. Advances in Neural Information Processing Systems. 2011, p. 2375-83.
  • Ester M, Kreigel H, Sander J, Wimmer M, Xu X. Incremental Clustering for Mining in a Data Warehousing Environment. 24th International Conference on Very Large Data Bases. 1998, p. 323-33.
  • Ester M, Kreigel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, 1996, p. 226-31.
  • Fisher D. Knowledge Acquisition via Incremental Conceptual Clustering. Journal of Machine Learning. 1987; 2(2):139-72.
  • Zhang T, Ramakrishnan R, Livn M. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery. 1997; 1(2):141-82.
  • Aggarwal C, Han J, Wang J, Yu P. A framework for clustering evolving data streams. Very Large Databases. 2003, p. 81-92.
  • Cao F, Ester M, Qian W, Zhou A, Density-based clustering over an evolving data stream with noise. SIAM International Conference on Data Mining, US, 2006, p. 326-37.
  • Williams R, Rumelhart D, Hinton G, Learning representation by back-propagation errors. Nature. 1986 Oct 9; 323:533-36.
  • Altman N, An introduction to kernel and nearest-neighbor nonparametric regression. 1992Aug; 46(3):175-85.
  • Platt J, Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods - Suport Vector Learning, 1999, p. 185-208.
  • Breiman L. Random Forests. Machine Learning. 2001; 45(1):5-32.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.