Bayesian naive classification¶
Naive Bayesian classification is a type of simple probabilistic classification based on Bayes' theorem with strong (naive) independence of assumptions. It uses a naive Bayesian classifier, or naive Bayes classifier, belonging to the family of linear classifiers.
Our conditional model can be written as $p(C\vert F_{1},\dots ,F_{n})$, where $C$ is a dependent class variable whose instances or classes are few in number, conditioned by several characteristic variables $F1,...,Fn$.
Using Bayes' theorem, we write :
$$p(C\vert F_{1},\dots ,F_{n})={\frac {p(C)\ p(F_{1},\dots ,F_{n}\vert C)}{p(F_{1},\dots ,F_{n})}}.\,$$
in everyday language, this equation can be summarized as :
$${\mbox{posterior}}={\frac {{\mbox{previous}}\times {\mbox{likelihood}}}{{\mbox{obvious}}}}.\,$$
It's important that the explanatory variables are independent, otherwise the model will find it harder to predict the target variable correctly!
Advantages of the Naive Bayes model¶
The Naive Bayes model has a naturally large bias and low variance, making it ideal for training models on small volumes of data. It does not require choosing the form of the function that links $X$ and $Y$, and can therefore be adapted to non-linear problems.
Disadvantages of Naive Bayes¶
Due to its high bias, the Naive Bayes model is not ideal for high data volumes, as it will not achieve the best possible performance, unlike Random Forests.
Bayesian classifiers also treat each variable independently, so they won't be able to take into account information arising from the interaction between several variables.
The dataset¶
The pima-indians-diabetes
dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient is diabetic or not, based on certain diagnostic measures included in the dataset. Several constraints have been placed on the selection of these instances from a larger database. In particular, all patients here are female, at least 21 years old and of Pima Indian origin.
Below are the column names:
columns={0:'Pregnancies',
1:'Glucose',
2:'BloodPressure',
3:'SkinThickness',
4:'Insulin',
5:'BMI',
6:'DiabetesPedigreeFunction',
7:'Age',
8:'Outcome'}
#do some import
import warnings
warnings.simplefilter(action='ignore')
#import the pima-indians-diabetes dataset
#print the head of the dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
#print the basics stats
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
#build a new dataset by deleting the 0's in the rows
#display statistics for this new dataset
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
count | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 |
mean | 3.301020 | 122.627551 | 70.663265 | 29.145408 | 156.056122 | 33.086224 | 0.523046 | 30.864796 | 0.331633 |
std | 3.211424 | 30.860781 | 12.496092 | 10.516424 | 118.841690 | 7.027659 | 0.345488 | 10.200777 | 0.471401 |
min | 0.000000 | 56.000000 | 24.000000 | 7.000000 | 14.000000 | 18.200000 | 0.085000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 21.000000 | 76.750000 | 28.400000 | 0.269750 | 23.000000 | 0.000000 |
50% | 2.000000 | 119.000000 | 70.000000 | 29.000000 | 125.500000 | 33.200000 | 0.449500 | 27.000000 | 0.000000 |
75% | 5.000000 | 143.000000 | 78.000000 | 37.000000 | 190.000000 | 37.100000 | 0.687000 | 36.000000 | 1.000000 |
max | 17.000000 | 198.000000 | 110.000000 | 63.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
#rename the columns of your dataset with the variables in the description above
Outliers¶
Let's take a look at the interquartile range. For example, if ${displaystyle Q_{1}} and {\displaystyle Q_{3}}$ are the first and third quartiles respectively, then we can define an outlier as any value outside the range: $${\big [}Q_{1}-k(Q_{3}-Q_{1}) ; Q_{3}+k(Q_{3}-Q_{1}){\big ]}$$. with $k$ a positive constant.
#create variable q1 corresponding to the first quantile of variable 'Insulin
76.75
#create variable q3 corresponding to the third quantile of variable 'Insulin
190.0
#define the above interval with k=1.5
#display the interval
#what do you notice?
Intervalle interquartile : [-93.125 ; 359.875]
#define a mask in your dataset to filter out individuals exceeding the upper bound
#take k=1.5
#display these individuals by class
1 15 0 10 Name: Outcome, dtype: int64
#print the boxplot of this variable
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
<matplotlib.axes._subplots.AxesSubplot at 0x1a1836db70>
#create a new dataset for outliers
#print the new dataset
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | insuline_aberrant | |
---|---|---|---|---|---|---|---|---|---|
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | False |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | False |
6 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | False |
8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | True |
13 | 1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | True |
Independence, correlation and normality¶
#perform Student's t tests on your variables 2 by 2
#arrange test results in an array called test_result
#use scipy's ttest_ind method
#print the result vector
[['Pregnancies', 'Glucose', 0.0], ['Pregnancies', 'BloodPressure', 0.0], ['Pregnancies', 'SkinThickness', 0.0], ['Pregnancies', 'Insulin', 0.0], ['Pregnancies', 'BMI', 0.0], ['Pregnancies', 'DiabetesPedigreeFunction', 1.5150486355681569e-55], ['Pregnancies', 'Age', 0.0], ['Glucose', 'BloodPressure', 0.0], ['Glucose', 'SkinThickness', 0.0], ['Glucose', 'Insulin', 9.314347797149482e-08], ['Glucose', 'BMI', 0.0], ['Glucose', 'DiabetesPedigreeFunction', 0.0], ['Glucose', 'Age', 0.0], ['BloodPressure', 'SkinThickness', 0.0], ['BloodPressure', 'Insulin', 1.247189726766044e-40], ['BloodPressure', 'BMI', 0.0], ['BloodPressure', 'DiabetesPedigreeFunction', 0.0], ['BloodPressure', 'Age', 0.0], ['SkinThickness', 'Insulin', 2.394527471199125e-78], ['SkinThickness', 'BMI', 1.103926415575936e-09], ['SkinThickness', 'DiabetesPedigreeFunction', 0.0], ['SkinThickness', 'Age', 0.02040562078074461], ['Insulin', 'BMI', 8.560440687697287e-75], ['Insulin', 'DiabetesPedigreeFunction', 0.0], ['Insulin', 'Age', 1.0405923224374392e-76], ['BMI', 'DiabetesPedigreeFunction', 0.0], ['BMI', 'Age', 0.0004073494231140429], ['DiabetesPedigreeFunction', 'Age', 0.0]]
#print the result vector size
28
#do the same with the correlation of variables 2 by 2
#use numpy's corrcoef method
#print the correlation
#did you notice anything ?
[['Pregnancies', 'Pregnancies', 0.9999999999999999], ['Pregnancies', 'Glucose', 0.19829104308052087], ['Pregnancies', 'BloodPressure', 0.21335477472245085], ['Pregnancies', 'SkinThickness', 0.0932093974054524], ['Pregnancies', 'Insulin', 0.07898362510990971], ['Pregnancies', 'BMI', -0.025347276056046256], ['Pregnancies', 'DiabetesPedigreeFunction', 0.007562116438437554], ['Pregnancies', 'Age', 0.6796084703853134], ['Glucose', 'Glucose', 1.0], ['Glucose', 'BloodPressure', 0.21002657364775343], ['Glucose', 'SkinThickness', 0.19885581885227427], ['Glucose', 'Insulin', 0.5812230123542533], ['Glucose', 'BMI', 0.20951591881842818], ['Glucose', 'DiabetesPedigreeFunction', 0.1401801799076905], ['Glucose', 'Age', 0.34364149991026494], ['BloodPressure', 'BloodPressure', 1.0], ['BloodPressure', 'SkinThickness', 0.23257118913532568], ['BloodPressure', 'Insulin', 0.09851150312787163], ['BloodPressure', 'BMI', 0.30440336850359956], ['BloodPressure', 'DiabetesPedigreeFunction', -0.01597110350582252], ['BloodPressure', 'Age', 0.3000389462787932], ['SkinThickness', 'SkinThickness', 1.0], ['SkinThickness', 'Insulin', 0.18219906133857003], ['SkinThickness', 'BMI', 0.664354866692933], ['SkinThickness', 'DiabetesPedigreeFunction', 0.16049852633674916], ['SkinThickness', 'Age', 0.16776114150160307], ['Insulin', 'Insulin', 1.0], ['Insulin', 'BMI', 0.22639651774497568], ['Insulin', 'DiabetesPedigreeFunction', 0.13590578113752144], ['Insulin', 'Age', 0.21708199090471678], ['BMI', 'BMI', 1.0], ['BMI', 'DiabetesPedigreeFunction', 0.15877104319825314], ['BMI', 'Age', 0.06981379857867923], ['DiabetesPedigreeFunction', 'DiabetesPedigreeFunction', 1.0], ['DiabetesPedigreeFunction', 'Age', 0.08502910583181746], ['Age', 'Age', 1.0]]
#print the correlation with a heatmap
<matplotlib.axes._subplots.AxesSubplot at 0xa170a3828>
#calculate a table of mutual information for 2 by 2 variables
#use sklearn's mutual_info_regression method
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
#print the result
[['Pregnancies', 'Pregnancies', 2.2736527229471135], ['Pregnancies', 'Glucose', 0.06230826884354945], ['Pregnancies', 'BloodPressure', 0.04128938009668337], ['Pregnancies', 'SkinThickness', 0.06981419164525038], ['Pregnancies', 'Insulin', 0.07144695759892006], ['Pregnancies', 'BMI', 0.02266307126704259], ['Pregnancies', 'DiabetesPedigreeFunction', 0.011517435047145419], ['Pregnancies', 'Age', 0.3589564459068493], ['Glucose', 'Glucose', 4.253121630357338], ['Glucose', 'BloodPressure', 0.025865291162842308], ['Glucose', 'SkinThickness', 0.035839119969039324], ['Glucose', 'Insulin', 0.24815944945864699], ['Glucose', 'BMI', 0.07157269657654197], ['Glucose', 'DiabetesPedigreeFunction', 0.004637248822719098], ['Glucose', 'Age', 0.08298657569152468], ['BloodPressure', 'BloodPressure', 3.1197847440079407], ['BloodPressure', 'SkinThickness', 0.06920103345723172], ['BloodPressure', 'Insulin', 0.02376809930834689], ['BloodPressure', 'BMI', 0.04098648168260466], ['BloodPressure', 'DiabetesPedigreeFunction', 0], ['BloodPressure', 'Age', 0.022668432117208148], ['SkinThickness', 'SkinThickness', 3.631481585311681], ['SkinThickness', 'Insulin', 0.05448065715432637], ['SkinThickness', 'BMI', 0.2725646949873739], ['SkinThickness', 'DiabetesPedigreeFunction', 0], ['SkinThickness', 'Age', 0.07109760661728526], ['Insulin', 'Insulin', 4.339512960116015], ['Insulin', 'BMI', 0], ['Insulin', 'DiabetesPedigreeFunction', 0.04675898235406928], ['Insulin', 'Age', 0.030836392262476586], ['BMI', 'BMI', 4.455850626134804], ['BMI', 'DiabetesPedigreeFunction', 0], ['BMI', 'Age', 0], ['DiabetesPedigreeFunction', 'DiabetesPedigreeFunction', 4.617779683472024], ['DiabetesPedigreeFunction', 'Age', 0.05967055171002], ['Age', 'Age', 3.2241209170936793]]
#separate your dataset into 2
#a dataset for positives (outcome==1)
#a negative dataset (outcome==0)
#check separation by displaying your dataset
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
6 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
13 | 1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | 1 |
14 | 5 | 166 | 72 | 19 | 175 | 25.8 | 0.587 | 51 | 1 |
#display a normal law of size=10000
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a1abc50>
#test the normality of on our positive data
#loop your dataset
#apply scipy's normaltest method
#finally store the associated p-value in a list
#do the same thing for the negative dataset
#display p-value table for all negatives
#what do you notice?
[['Pregnancies', 2.639388327961267e-17], ['Glucose', 1.3681430238566277e-05], ['BloodPressure', 0.07551043574204534], ['SkinThickness', 0.007786179616197784], ['Insulin', 7.767115230392163e-36], ['BMI', 0.015637691592256094], ['DiabetesPedigreeFunction', 2.6125230817653396e-27], ['Age', 2.3667823240888496e-30]]
#check your assumptions above using plot
#take bins=10 for the plot
#what can you deduce about the p-value?
#vérifiez vos hypothèses ci-dessus à l'aide de plot
#prendre bins=10 pour le graphique
#Que pouvez-vous déduire de la valeur p ?
#instantiate a naive bayes estimator
#fiter on training data (base data, not extrapolated data) and display model score
score du modèle sur le jeux de base : 72.88%
#create a new dataframe from the test dataset noted control
#add to this dataset the column y := dataset ytest
#add to this dataset the column y_pred := estimator prediction
#print the confusion matrix
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a2be048>
Bonus: creation of new variables & bagging¶
#discretize your training set
#use KBinsDiscretizer method with parameters in output
KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='quantile')
#use the KBinsDiscretizer method on your training set
#use your fit to generate new train and test datasets and apply dummies function on it
#display the shapes of your basic train dataset and that of the new dataset
#prepare for test
train shape : (274, 9) | (274, 78) test shape : (118, 9) | (118, 78)
#print the columns of your train dataset
Index(['Pregnancies_1.0', 'Pregnancies_3.0', 'Pregnancies_5.0', 'Pregnancies_6.0', 'Pregnancies_7.0', 'Pregnancies_8.0', 'Pregnancies_9.0', 'Glucose_0.0', 'Glucose_1.0', 'Glucose_2.0', 'Glucose_3.0', 'Glucose_4.0', 'Glucose_5.0', 'Glucose_6.0', 'Glucose_7.0', 'Glucose_8.0', 'Glucose_9.0', 'BloodPressure_0.0', 'BloodPressure_1.0', 'BloodPressure_2.0', 'BloodPressure_3.0', 'BloodPressure_4.0', 'BloodPressure_5.0', 'BloodPressure_6.0', 'BloodPressure_7.0', 'BloodPressure_8.0', 'BloodPressure_9.0', 'SkinThickness_0.0', 'SkinThickness_1.0', 'SkinThickness_2.0', 'SkinThickness_3.0', 'SkinThickness_4.0', 'SkinThickness_5.0', 'SkinThickness_6.0', 'SkinThickness_7.0', 'SkinThickness_8.0', 'SkinThickness_9.0', 'Insulin_0.0', 'Insulin_1.0', 'Insulin_2.0', 'Insulin_3.0', 'Insulin_4.0', 'Insulin_5.0', 'Insulin_6.0', 'Insulin_7.0', 'Insulin_8.0', 'Insulin_9.0', 'BMI_0.0', 'BMI_1.0', 'BMI_2.0', 'BMI_3.0', 'BMI_4.0', 'BMI_5.0', 'BMI_6.0', 'BMI_7.0', 'BMI_8.0', 'BMI_9.0', 'DiabetesPedigreeFunction_0.0', 'DiabetesPedigreeFunction_1.0', 'DiabetesPedigreeFunction_2.0', 'DiabetesPedigreeFunction_3.0', 'DiabetesPedigreeFunction_4.0', 'DiabetesPedigreeFunction_5.0', 'DiabetesPedigreeFunction_6.0', 'DiabetesPedigreeFunction_7.0', 'DiabetesPedigreeFunction_8.0', 'DiabetesPedigreeFunction_9.0', 'Age_0.0', 'Age_1.0', 'Age_2.0', 'Age_3.0', 'Age_4.0', 'Age_5.0', 'Age_6.0', 'Age_7.0', 'Age_8.0', 'Age_9.0', 'insuline_aberrant_9.0'], dtype='object')
#fiter on your training data extrapolated using KBinsDiscretizer and display this model's score
#what do you notice?
score du modèle sur le jeux extrapolé : 76.27%
#use sklean's BaggingClassifier estimator and refine model accuracy
#print the accuracy of the model
Accuracy du model sur le dataset de test 72.881%