Total views : 343

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R


  • Atria Institute of Technology, Bangalore – 24, Karnataka, India
  • Department of ISE, Atria Institute of Technology, Bangalore - 24, Karnataka, India


Objectives: Diabetic is one of the most venerable disease spreading in the world, it will caused due to hereditary and also due to lack of diet. But if we analyze this disease then we can find some fact from the symptoms. Using these facts we can make a predicting model to predict the diabetic disease. By using this model the prediction of the diabetic will be easier and lots of benefits can be provided to the humanity. By sharing the information we extract from our model to the government will help the government for making the welfare program for the citizens. Method and Analysis: In this paper we have taken the sample of Pima Indian diabetic dataset which is having the 768 samples. So first of all that dataset will be given as input to hive so as to convert it into a formatted dataset. Then we will apply few queries on the formatted dataset in order to extract the useful information. Then we use the R tool in order to perform the statically analysis for generating the graph and also for calculating gini index and developing the prediction model, and efficiency of the model is also found. Findings: In our paper we have performed few queries on the diabetic dataset using hive such as finding the distinct values from the table and by finding it we can analyze the different attributes of the table and also time taken for analysis can also be calculated by default which is one of the positive points of using the hive. Then we will be using the r tool for statically analysis, as we all know picture speaks more than the word so by using the graph generated by r tool we can analyze the dataset easily and fast as compared to going through each rows of the dataset. We calculate gini index for attributes in order to find the inequality among the values using r tool. We also make the prediction model using KNN algorithm and we also find the accuracy of our model. These all things done by the use of r tool, which makes it simpler and also make the method easy to understand by the user to make prediction model and to calculate the efficiency of the model. By using the prediction model we can find the number of sample predictions made correctly. Improvements: We can improve the paper by doing the operations performed on large dataset such as millions of dataset in order to make paper more efficient. Our project efficiency is about 79% which can further be improved.


Big-Data, Gini Index, Hadoop, Hive, K Nearest Neighbor, R.

Full Text:

 |  (PDF views: 323)


  • Sadhana SS, Shetty S. Analysis of diabetic dataset using hive and R. 2014.
  • Aljumah AA, Ahamad MG, Siddiqui MK. Application of data mining: Diabetes health care in young and old patients.Journal of King Saud University – Computer and Information Sciences. 2013; 25:127–36.
  • Rajesh K, Sangeetha V. Application of data mining methods and techniques for diabetes diagnosis. IJEIT. 2012 Sept; 2(3).
  • Bagdi R. Patil P. Diagnosis of diabetes using OLAP and data mining integration. International Journal of Computer Science and Communication Networks. 2014; 2(3): 314-22.
  • Available from: blog/2015/08/learning-concept-knn-algorithms programming/
  • Available from: inequality-and-lorenz-curve-r/
  • Available from:
  • Available from: neighbors.html


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.