Bayesian naive classification¶

Naive Bayesian classification is a type of simple probabilistic classification based on Bayes' theorem with strong (naive) independence of assumptions. It uses a naive Bayesian classifier, or naive Bayes classifier, belonging to the family of linear classifiers.

Our conditional model can be written as $p(C\vert F_{1},\dots ,F_{n})$, where $C$ is a dependent class variable whose instances or classes are few in number, conditioned by several characteristic variables $F1,...,Fn$.

Using Bayes' theorem, we write :

$p (C | F_{1}, \dots, F_{n}) = \frac{p (C) p (F_{1}, \dots, F_{n} | C)}{p (F_{1}, \dots, F_{n})} .$

in everyday language, this equation can be summarized as :

$posterior = \frac{previous \times likelihood}{obvious} .$

It's important that the explanatory variables are independent, otherwise the model will find it harder to predict the target variable correctly!

Advantages of the Naive Bayes model¶

The Naive Bayes model has a naturally large bias and low variance, making it ideal for training models on small volumes of data. It does not require choosing the form of the function that links $X$ and $Y$, and can therefore be adapted to non-linear problems.

Disadvantages of Naive Bayes¶

Due to its high bias, the Naive Bayes model is not ideal for high data volumes, as it will not achieve the best possible performance, unlike Random Forests.

Bayesian classifiers also treat each variable independently, so they won't be able to take into account information arising from the interaction between several variables.

The dataset¶

The pima-indians-diabetes dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient is diabetic or not, based on certain diagnostic measures included in the dataset. Several constraints have been placed on the selection of these instances from a larger database. In particular, all patients here are female, at least 21 years old and of Pima Indian origin. Below are the column names:

columns={0:'Pregnancies',
         1:'Glucose',
         2:'BloodPressure', 
         3:'SkinThickness', 
         4:'Insulin',
         5:'BMI',
         6:'DiabetesPedigreeFunction',
         7:'Age',
         8:'Outcome'}

In [1]:

Copied!

#do some import
import warnings
warnings.simplefilter(action='ignore')
#do some import
import warnings
warnings.simplefilter(action='ignore')

In [2]:

Copied!

#import the pima-indians-diabetes dataset
#import the pima-indians-diabetes dataset

In [3]:

Copied!

#print the head of the dataset
#print the head of the dataset

Out[3]:

	0	1	2	3	4	5	6	7	8
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

In [4]:

Copied!

#print the basics stats 
#print the basics stats

Out[4]:

	0	1	2	3	4	5	6	7	8
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

In [5]:

Copied!

#build a new dataset by deleting the 0's in the rows
#display statistics for this new dataset
#build a new dataset by deleting the 0's in the rows
#display statistics for this new dataset

Out[5]:

	0	1	2	3	4	5	6	7	8
count	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000	392.000000
mean	3.301020	122.627551	70.663265	29.145408	156.056122	33.086224	0.523046	30.864796	0.331633
std	3.211424	30.860781	12.496092	10.516424	118.841690	7.027659	0.345488	10.200777	0.471401
min	0.000000	56.000000	24.000000	7.000000	14.000000	18.200000	0.085000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	21.000000	76.750000	28.400000	0.269750	23.000000	0.000000
50%	2.000000	119.000000	70.000000	29.000000	125.500000	33.200000	0.449500	27.000000	0.000000
75%	5.000000	143.000000	78.000000	37.000000	190.000000	37.100000	0.687000	36.000000	1.000000
max	17.000000	198.000000	110.000000	63.000000	846.000000	67.100000	2.420000	81.000000	1.000000

In [7]:

Copied!

#rename the columns of your dataset with the variables in the description above
#rename the columns of your dataset with the variables in the description above

Outliers¶

Let's take a look at the interquartile range. For example, if ${displaystyle Q_{1}} and {\displaystyle Q_{3}}$ are the first and third quartiles respectively, then we can define an outlier as any value outside the range: $[Q_{1} - k (Q_{3} - Q_{1}); Q_{3} + k (Q_{3} - Q_{1})]$ . with $k$ a positive constant.

In [8]:

Copied!

#create variable q1 corresponding to the first quantile of variable 'Insulin
#create variable q1 corresponding to the first quantile of variable 'Insulin

Out[8]:

76.75

In [9]:

Copied!

#create variable q3 corresponding to the third quantile of variable 'Insulin
#create variable q3 corresponding to the third quantile of variable 'Insulin

Out[9]:

190.0

In [10]:

Copied!

#define the above interval with k=1.5
#display the interval 
#what do you notice?
#define the above interval with k=1.5
#display the interval 
#what do you notice?

Intervalle interquartile : [-93.125 ; 359.875]

In [11]:

Copied!

#define a mask in your dataset to filter out individuals exceeding the upper bound 
#take k=1.5
#define a mask in your dataset to filter out individuals exceeding the upper bound 
#take k=1.5

In [12]:

Copied!

#display these individuals by class
#display these individuals by class

Out[12]:

1    15
0    10
Name: Outcome, dtype: int64

In [13]:

Copied!





#print the boxplot of this variable 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
#print the boxplot of this variable 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1836db70>

No description has been provided for this image

In [14]:

Copied!

#create a new dataset for outliers
#create a new dataset for outliers

In [15]:

Copied!

#print the new dataset
#print the new dataset

Out[15]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	insuline_aberrant
3	1	89	66	23	94	28.1	0.167	21	False
4	0	137	40	35	168	43.1	2.288	33	False
6	3	78	50	32	88	31.0	0.248	26	False
8	2	197	70	45	543	30.5	0.158	53	True
13	1	189	60	23	846	30.1	0.398	59	True

Independence, correlation and normality¶

In [16]:

Copied!

#perform Student's t tests on your variables 2 by 2 
#arrange test results in an array called test_result
#use scipy's ttest_ind method
#perform Student's t tests on your variables 2 by 2 
#arrange test results in an array called test_result
#use scipy's ttest_ind method

In [17]:

Copied!

#print the result vector
#print the result vector

Out[17]:

[['Pregnancies', 'Glucose', 0.0],
 ['Pregnancies', 'BloodPressure', 0.0],
 ['Pregnancies', 'SkinThickness', 0.0],
 ['Pregnancies', 'Insulin', 0.0],
 ['Pregnancies', 'BMI', 0.0],
 ['Pregnancies', 'DiabetesPedigreeFunction', 1.5150486355681569e-55],
 ['Pregnancies', 'Age', 0.0],
 ['Glucose', 'BloodPressure', 0.0],
 ['Glucose', 'SkinThickness', 0.0],
 ['Glucose', 'Insulin', 9.314347797149482e-08],
 ['Glucose', 'BMI', 0.0],
 ['Glucose', 'DiabetesPedigreeFunction', 0.0],
 ['Glucose', 'Age', 0.0],
 ['BloodPressure', 'SkinThickness', 0.0],
 ['BloodPressure', 'Insulin', 1.247189726766044e-40],
 ['BloodPressure', 'BMI', 0.0],
 ['BloodPressure', 'DiabetesPedigreeFunction', 0.0],
 ['BloodPressure', 'Age', 0.0],
 ['SkinThickness', 'Insulin', 2.394527471199125e-78],
 ['SkinThickness', 'BMI', 1.103926415575936e-09],
 ['SkinThickness', 'DiabetesPedigreeFunction', 0.0],
 ['SkinThickness', 'Age', 0.02040562078074461],
 ['Insulin', 'BMI', 8.560440687697287e-75],
 ['Insulin', 'DiabetesPedigreeFunction', 0.0],
 ['Insulin', 'Age', 1.0405923224374392e-76],
 ['BMI', 'DiabetesPedigreeFunction', 0.0],
 ['BMI', 'Age', 0.0004073494231140429],
 ['DiabetesPedigreeFunction', 'Age', 0.0]]

In [18]:

Copied!

#print the result vector size 
#print the result vector size

Out[18]:

In [19]:

Copied!

#do the same with the correlation of variables 2 by 2 
#use numpy's corrcoef method
#do the same with the correlation of variables 2 by 2 
#use numpy's corrcoef method

In [ ]:

Copied!

#print the correlation
#did you notice anything ?
#print the correlation
#did you notice anything ?

Out[ ]:

[['Pregnancies', 'Pregnancies', 0.9999999999999999],
 ['Pregnancies', 'Glucose', 0.19829104308052087],
 ['Pregnancies', 'BloodPressure', 0.21335477472245085],
 ['Pregnancies', 'SkinThickness', 0.0932093974054524],
 ['Pregnancies', 'Insulin', 0.07898362510990971],
 ['Pregnancies', 'BMI', -0.025347276056046256],
 ['Pregnancies', 'DiabetesPedigreeFunction', 0.007562116438437554],
 ['Pregnancies', 'Age', 0.6796084703853134],
 ['Glucose', 'Glucose', 1.0],
 ['Glucose', 'BloodPressure', 0.21002657364775343],
 ['Glucose', 'SkinThickness', 0.19885581885227427],
 ['Glucose', 'Insulin', 0.5812230123542533],
 ['Glucose', 'BMI', 0.20951591881842818],
 ['Glucose', 'DiabetesPedigreeFunction', 0.1401801799076905],
 ['Glucose', 'Age', 0.34364149991026494],
 ['BloodPressure', 'BloodPressure', 1.0],
 ['BloodPressure', 'SkinThickness', 0.23257118913532568],
 ['BloodPressure', 'Insulin', 0.09851150312787163],
 ['BloodPressure', 'BMI', 0.30440336850359956],
 ['BloodPressure', 'DiabetesPedigreeFunction', -0.01597110350582252],
 ['BloodPressure', 'Age', 0.3000389462787932],
 ['SkinThickness', 'SkinThickness', 1.0],
 ['SkinThickness', 'Insulin', 0.18219906133857003],
 ['SkinThickness', 'BMI', 0.664354866692933],
 ['SkinThickness', 'DiabetesPedigreeFunction', 0.16049852633674916],
 ['SkinThickness', 'Age', 0.16776114150160307],
 ['Insulin', 'Insulin', 1.0],
 ['Insulin', 'BMI', 0.22639651774497568],
 ['Insulin', 'DiabetesPedigreeFunction', 0.13590578113752144],
 ['Insulin', 'Age', 0.21708199090471678],
 ['BMI', 'BMI', 1.0],
 ['BMI', 'DiabetesPedigreeFunction', 0.15877104319825314],
 ['BMI', 'Age', 0.06981379857867923],
 ['DiabetesPedigreeFunction', 'DiabetesPedigreeFunction', 1.0],
 ['DiabetesPedigreeFunction', 'Age', 0.08502910583181746],
 ['Age', 'Age', 1.0]]

In [21]:

Copied!

#print the correlation with a heatmap 
#print the correlation with a heatmap

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0xa170a3828>

In [22]:

Copied!

#calculate a table of mutual information for 2 by 2 variables 
#use sklearn's mutual_info_regression method
#calculate a table of mutual information for 2 by 2 variables 
#use sklearn's mutual_info_regression method

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

In [23]:

Copied!

#print the result
#print the result

Out[23]:

[['Pregnancies', 'Pregnancies', 2.2736527229471135],
 ['Pregnancies', 'Glucose', 0.06230826884354945],
 ['Pregnancies', 'BloodPressure', 0.04128938009668337],
 ['Pregnancies', 'SkinThickness', 0.06981419164525038],
 ['Pregnancies', 'Insulin', 0.07144695759892006],
 ['Pregnancies', 'BMI', 0.02266307126704259],
 ['Pregnancies', 'DiabetesPedigreeFunction', 0.011517435047145419],
 ['Pregnancies', 'Age', 0.3589564459068493],
 ['Glucose', 'Glucose', 4.253121630357338],
 ['Glucose', 'BloodPressure', 0.025865291162842308],
 ['Glucose', 'SkinThickness', 0.035839119969039324],
 ['Glucose', 'Insulin', 0.24815944945864699],
 ['Glucose', 'BMI', 0.07157269657654197],
 ['Glucose', 'DiabetesPedigreeFunction', 0.004637248822719098],
 ['Glucose', 'Age', 0.08298657569152468],
 ['BloodPressure', 'BloodPressure', 3.1197847440079407],
 ['BloodPressure', 'SkinThickness', 0.06920103345723172],
 ['BloodPressure', 'Insulin', 0.02376809930834689],
 ['BloodPressure', 'BMI', 0.04098648168260466],
 ['BloodPressure', 'DiabetesPedigreeFunction', 0],
 ['BloodPressure', 'Age', 0.022668432117208148],
 ['SkinThickness', 'SkinThickness', 3.631481585311681],
 ['SkinThickness', 'Insulin', 0.05448065715432637],
 ['SkinThickness', 'BMI', 0.2725646949873739],
 ['SkinThickness', 'DiabetesPedigreeFunction', 0],
 ['SkinThickness', 'Age', 0.07109760661728526],
 ['Insulin', 'Insulin', 4.339512960116015],
 ['Insulin', 'BMI', 0],
 ['Insulin', 'DiabetesPedigreeFunction', 0.04675898235406928],
 ['Insulin', 'Age', 0.030836392262476586],
 ['BMI', 'BMI', 4.455850626134804],
 ['BMI', 'DiabetesPedigreeFunction', 0],
 ['BMI', 'Age', 0],
 ['DiabetesPedigreeFunction', 'DiabetesPedigreeFunction', 4.617779683472024],
 ['DiabetesPedigreeFunction', 'Age', 0.05967055171002],
 ['Age', 'Age', 3.2241209170936793]]

In [24]:

Copied!

#separate your dataset into 2 
#a dataset for positives (outcome==1)
#a negative dataset (outcome==0)
#separate your dataset into 2 
#a dataset for positives (outcome==1)
#a negative dataset (outcome==0)

In [25]:

Copied!

#check separation by displaying your dataset
#check separation by displaying your dataset

Out[25]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
4	0	137	40	35	168	43.1	2.288	33	1
6	3	78	50	32	88	31.0	0.248	26	1
8	2	197	70	45	543	30.5	0.158	53	1
13	1	189	60	23	846	30.1	0.398	59	1
14	5	166	72	19	175	25.8	0.587	51	1

In [27]:

Copied!

#display a normal law of size=10000
#display a normal law of size=10000

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a1abc50>

In [28]:

Copied!





#test the normality of on our positive data  
#loop your dataset 
#apply scipy's normaltest method 
#finally store the associated p-value in a list
#test the normality of on our positive data  
#loop your dataset 
#apply scipy's normaltest method 
#finally store the associated p-value in a list

In [29]:

Copied!

#do the same thing for the negative dataset
#do the same thing for the negative dataset

In [30]:

Copied!

#display p-value table for all negatives 
#what do you notice?
#display p-value table for all negatives 
#what do you notice?

Out[30]:

[['Pregnancies', 2.639388327961267e-17],
 ['Glucose', 1.3681430238566277e-05],
 ['BloodPressure', 0.07551043574204534],
 ['SkinThickness', 0.007786179616197784],
 ['Insulin', 7.767115230392163e-36],
 ['BMI', 0.015637691592256094],
 ['DiabetesPedigreeFunction', 2.6125230817653396e-27],
 ['Age', 2.3667823240888496e-30]]

In [31]:

Copied!

#check your assumptions above using plot 
#take bins=10 for the plot 
#what can you deduce about the p-value?
#check your assumptions above using plot 
#take bins=10 for the plot 
#what can you deduce about the p-value?

In [32]:

Copied!

#vérifiez vos hypothèses ci-dessus à l'aide de plot 
#prendre bins=10 pour le graphique 
#Que pouvez-vous déduire de la valeur p ?
#vérifiez vos hypothèses ci-dessus à l'aide de plot 
#prendre bins=10 pour le graphique 
#Que pouvez-vous déduire de la valeur p ?

In [33]:

Copied!

#instantiate a naive bayes estimator
#instantiate a naive bayes estimator

In [34]:

Copied!

#fiter on training data (base data, not extrapolated data) and display model score
#fiter on training data (base data, not extrapolated data) and display model score

score du modèle sur le jeux de base : 72.88%

In [35]:

Copied!

#create a new dataframe from the test dataset noted control 
#add to this dataset the column y := dataset ytest 
#add to this dataset the column y_pred := estimator prediction
#create a new dataframe from the test dataset noted control 
#add to this dataset the column y := dataset ytest 
#add to this dataset the column y_pred := estimator prediction

In [36]:

Copied!

#print the confusion matrix
#print the confusion matrix

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a2be048>

Bonus: creation of new variables & bagging¶

In [37]:

Copied!

#discretize your training set 
#use KBinsDiscretizer method with parameters in output
#discretize your training set 
#use KBinsDiscretizer method with parameters in output

Out[37]:

KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='quantile')

In [38]:

Copied!

#use the KBinsDiscretizer method on your training set
#use the KBinsDiscretizer method on your training set

In [39]:

Copied!

#use your fit to generate new train and test datasets and apply dummies function on it
#use your fit to generate new train and test datasets and apply dummies function on it

In [40]:

Copied!

#display the shapes of your basic train dataset and that of the new dataset  
#prepare for test 
#display the shapes of your basic train dataset and that of the new dataset  
#prepare for test

train shape : (274, 9) | (274, 78)
test shape : (118, 9) | (118, 78)

In [41]:

Copied!

#print the columns of your train dataset 
#print the columns of your train dataset

Out[41]:

Index(['Pregnancies_1.0', 'Pregnancies_3.0', 'Pregnancies_5.0',
       'Pregnancies_6.0', 'Pregnancies_7.0', 'Pregnancies_8.0',
       'Pregnancies_9.0', 'Glucose_0.0', 'Glucose_1.0', 'Glucose_2.0',
       'Glucose_3.0', 'Glucose_4.0', 'Glucose_5.0', 'Glucose_6.0',
       'Glucose_7.0', 'Glucose_8.0', 'Glucose_9.0', 'BloodPressure_0.0',
       'BloodPressure_1.0', 'BloodPressure_2.0', 'BloodPressure_3.0',
       'BloodPressure_4.0', 'BloodPressure_5.0', 'BloodPressure_6.0',
       'BloodPressure_7.0', 'BloodPressure_8.0', 'BloodPressure_9.0',
       'SkinThickness_0.0', 'SkinThickness_1.0', 'SkinThickness_2.0',
       'SkinThickness_3.0', 'SkinThickness_4.0', 'SkinThickness_5.0',
       'SkinThickness_6.0', 'SkinThickness_7.0', 'SkinThickness_8.0',
       'SkinThickness_9.0', 'Insulin_0.0', 'Insulin_1.0', 'Insulin_2.0',
       'Insulin_3.0', 'Insulin_4.0', 'Insulin_5.0', 'Insulin_6.0',
       'Insulin_7.0', 'Insulin_8.0', 'Insulin_9.0', 'BMI_0.0', 'BMI_1.0',
       'BMI_2.0', 'BMI_3.0', 'BMI_4.0', 'BMI_5.0', 'BMI_6.0', 'BMI_7.0',
       'BMI_8.0', 'BMI_9.0', 'DiabetesPedigreeFunction_0.0',
       'DiabetesPedigreeFunction_1.0', 'DiabetesPedigreeFunction_2.0',
       'DiabetesPedigreeFunction_3.0', 'DiabetesPedigreeFunction_4.0',
       'DiabetesPedigreeFunction_5.0', 'DiabetesPedigreeFunction_6.0',
       'DiabetesPedigreeFunction_7.0', 'DiabetesPedigreeFunction_8.0',
       'DiabetesPedigreeFunction_9.0', 'Age_0.0', 'Age_1.0', 'Age_2.0',
       'Age_3.0', 'Age_4.0', 'Age_5.0', 'Age_6.0', 'Age_7.0', 'Age_8.0',
       'Age_9.0', 'insuline_aberrant_9.0'],
      dtype='object')

In [42]:

Copied!

#fiter on your training data extrapolated using KBinsDiscretizer and display this model's score 
#what do you notice?
#fiter on your training data extrapolated using KBinsDiscretizer and display this model's score 
#what do you notice?

score du modèle sur le jeux extrapolé : 76.27%

In [43]:

Copied!

#use sklean's BaggingClassifier estimator and refine model accuracy
#use sklean's BaggingClassifier estimator and refine model accuracy

In [44]:

Copied!

#print the accuracy of the model
#print the accuracy of the model

Accuracy du model sur le dataset de test 72.881%

In [ ]: