Logistics Regression for Titanic data¶

Logistic regression is a binomial regression model. As with all binomial regression models, the aim is to best model a simple mathematical model with numerous real observations. In other words to associate with a vector of random variables a binomial random variable generically denoted y. Logistic regression is a special case of a generalized linear model. It is widely used in machine learning.

With linear regressions, we saw that our predictor was the line that our model drew. In a logistic regression, the line is simply a boundary that separates two categories. For example if a person will buy a product (represented by the number 1) or will not buy a product (represented by the number 0) the line will represent the probability that a person will buy or not according to his age.

You can consult this presentation of the University of Lile in order to have more details here

Some technical details¶

Unlike linear regressions that predict a number, classification models predict a category. For example, if you are trying to predict whether someone will buy a product based on certain independent variables, you are dealing with a classification problem because the categories you are trying to predict are "yes, the person will buy the product" or "no, the person will not buy my product". Logistic regressions are a category within classification models, but there are many others such as decision trees, SVM (support vector machine), or Naive Bayes.

When building a logistic regression model, we assume there is a function $f$ that links the target variable $Y$ to the explanatory variables represented in matrix $X$ as follows:

$$ P(Y=1)=f(X)+\epsilon $$ and $$ f(X) = \frac{1}{1\ +\ exp\bigl(-\bigl(\beta_{0}\ +\ X_{1}\beta_{1}\ +\ ...\ +\ X_{p}\beta_{p}\bigr)\bigr)} $$

where the form of the function $f$ is called the logistic function. The logistic function is a sigmoid function, which means that it is an S-shaped curve. The logistic function is defined as follows:

reg_logistique1

When we dealt with linear regressions, our predictor was the line drawn by our model. In logistic regression, the line is simply a boundary separating two categories. In the above graph, we are trying to determine if a person will buy a product (represented by the number 1) or will not buy a product (represented by the number 0). The line represents the probability that a person buys or does not buy based on their Age. The shape of the curve is a representation for an explanatory variable (here age) of the equation introduced earlier.

In this example, we only have one independent variable and a constant. The equation looks a lot like a linear regression, but here a logistic function is applied to the explanatory variable used for regression. This function constrains the values of $f(X^{T}\beta)$ to remain in the interval [0,1], which is the set of values that a probability can take. Based on the obtained probability, the algorithm will know which category to place our individual in. Generally, if the probability is greater than 0.5, the individual will be placed in the category 1, otherwise it will be placed in the category 0, we can also give the result as a percentage of probability.

How to classify¶

Now that we have drawn the line, we can begin our interpretations. Since our model is probabilistic this time, points with a probability greater than 50% will belong to category A, while points with a probability less than 50% will belong to category B. Depending on the problem considered, another threshold may be chosen. For example, in banking fraud issues, we tend to classify individuals with fraud probabilities less than 50% as fraudsters because we want to maximize the security of the banking system against fraudulent threats.

For instance, based on certain independent variables, we found that person A has a 60% chance of buying the product. She will therefore be considered an "buyer" by our model. On the other hand, if we have person B who only has a 45% chance of buying the product, she will be considered a "non-buyer".

Dummy Example in python with scikit-learn¶

from sklearn.linear_model import LogisticRegression
logisticreg = LogisticRegression() # defining the logistic regression model to apply to the data

logisticreg.fit(X, y) # model estimation
y_pred = logisticreg.predict(X) # model predictions
MSE = np.sqrt(np.mean((y_pred-y)**2)) # calculating the root mean squared error
compare_y_ypred = pd.DataFrame() # creating a dataframe to compare predictions and reality
compare_y_ypred['pred'] = y_pred
compare_y_ypred['y'] = y
logisticreg.score(X, y) # model accuracy

Example with Titanic data¶

In [2]:

Copied!

#load the libraries and titanic data
#you can find it here :  https://www.kaggle.com/c/titanic/data 
#load the libraries and titanic data
#you can find it here :  https://www.kaggle.com/c/titanic/data

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [3]:

Copied!

#print the dataset shape
#print the dataset shape

Taille du dataset d'entrainement : 891
Taille du dataset de test : 418

In [4]:

Copied!

#plot the NaN values with a heatmap
#plot the NaN values with a heatmap

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a9e7748>

No description has been provided for this image

Imputing missing values¶

Impuation is a method of replacing missing values with estimated values. There are several methods for imputing missing values, the most common of which are:

Imputation by the mean
Imputation by the median
Imputation by the mode
Imputation by the KNN method
Imputation by the MICE method
and many others, which you can find in this article

Let's begin by plotting the missing values in our dataset and understanding the distribution of missing values.

In [5]:

Copied!

#print the age % of NaN values
#print the age % of NaN values

Il y a 19.87% d'age qui manquent

In [6]:

Copied!

#print the histogram of the age column
#print the histogram of the age column

In [7]:

Copied!

#print the median and the mean of the age column
#print the median and the mean of the age column

moyenne : 29.70
medianne : 28.00

In [8]:

Copied!

#print the cabin % of NaN values
#print the cabin % of NaN values

Il y a 77.10% de Cabin qui manquent

In [9]:

Copied!

#print the embarked distribution 
#print the embarked distribution

Boarded passengers grouped by port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton):
S    644
C    168
Q     77
Name: Embarked, dtype: int64

Define Train & Test data¶

We will make the following changes to the data:

If "Age" is missing for a given row, we assign 28 (median age).
If "Embarked" is missing for a given line, we assign "S" (most common embarked).
We will ignore "Cabin" as a variable. There are too many missing values it wouldn't make sense to assign values to it.

In [10]:

Copied!

#apply the preprocessing like above
#apply the preprocessing like above

In [11]:

Copied!

#verify your results    
#verify your results

Out[11]:

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [12]:

Copied!

#print the histogram of the age column after and before processing 
#print the histogram of the age column after and before processing

Feature engineering¶

They are some linked variable inside our dataset to whether or not the passenger travels with his family. For simplicity, we will combine these variables into a single categorical variable: whether this person was traveling alone or not.

In [14]:

Copied!

#create the feature 'TravelAlone'  
#create the feature 'TravelAlone'

Out[14]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Embarked	TravelAlone
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	A/5 21171	7.2500	S	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	PC 17599	71.2833	C	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	S	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	113803	53.1000	S	0
4	5	0	3	Allen, Mr. William Henry	male	35.0	373450	8.0500	S	1

In [15]:

Copied!

#use the get_dummies function to encode every needed features 
#use the get_dummies function to encode every needed features

Out[15]:

	PassengerId	Survived	Name	Age	Ticket	Fare	TravelAlone	Pclass_1	Pclass_3	Embarked_C	Embarked_S	Sex_female	Sex_male
0	1	0	Braund, Mr. Owen Harris	22.0	A/5 21171	7.2500	0	0	1	0	1	0	1
1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	PC 17599	71.2833	0	1	0	1	0	1	0
2	3	1	Heikkinen, Miss. Laina	26.0	STON/O2. 3101282	7.9250	1	0	1	0	1	1	0
3	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	113803	53.1000	0	1	0	0	1	1	0
4	5	0	Allen, Mr. William Henry	35.0	373450	8.0500	1	0	1	0	1	0	1

In [16]:

Copied!

#drop the columns that you don't need
#drop the columns that you don't need

Out[16]:

	Survived	Age	Fare	TravelAlone	Pclass_1	Pclass_3	Embarked_C	Embarked_S	Sex_male
0	0	22.0	7.2500	0	0	1	0	1	1
1	1	38.0	71.2833	0	1	0	1	0	0
2	1	26.0	7.9250	1	0	1	0	1	0
3	1	35.0	53.1000	0	1	0	0	1	0
4	0	35.0	8.0500	1	0	1	0	1	1

Do the same with the test dataset¶

A good practice is to create new datasets or a copy for each new operation.

In [17]:

Copied!

#print the null value of this dataset
#print the null value of this dataset

Out[17]:

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [18]:

Copied!

#do the same thing for the test dataset
#do the same thing for the test dataset

Out[18]:

	Age	Fare	TravelAlone	Pclass_2	Pclass_3	Embarked_Q	Embarked_S	Sex_male
0	34.5	7.8292	1	0	1	1	0	1
1	47.0	7.0000	0	0	1	0	1	0
2	62.0	9.6875	1	1	0	1	0	1
3	27.0	8.6625	1	0	1	0	1	1
4	22.0	12.2875	0	0	1	0	1	0

Exploratory Data Analysis¶

The goal of this part is to understand the data and to see if there are any correlations between the variables. We will use the seaborn library to make our graphs more beautiful 😎

Age¶

In [19]:

Copied!

#print the age feature distribution and highlight the survived feature
#print the age feature distribution and highlight the survived feature

In [20]:

Copied!

#add the "IsMinor" variable to your data (a person is considered a minor if they are under 16)
#add the "IsMinor" variable to your data (a person is considered a minor if they are under 16)

Passenger Class¶

In [21]:

Copied!

#show survivors by class
#show survivors by class

Embarked Port¶

In [22]:

Copied!

#do the same for embarked feature
#do the same for embarked feature

Traveling Alone vs. With Family¶

In [23]:

Copied!

#do the same for the Traveling Alone feature we've created earlier 
#do the same for the Traveling Alone feature we've created earlier

Gender Variable¶

In [24]:

Copied!

#gender distribution of survivors
#gender distribution of survivors

Logistic Regression with scikit-learn¶

We have seen with the lasso method the importance of feature selection.

For this lab, we will use the feature selection method from sklearn. Find out about sklearn's feature selection methods and explain how they work

In [25]:

Copied!

#define x and y for the model
#define x and y for the model

In [26]:

Copied!

#instanciate a LogisticRegression estimator in sklearn 
#instanciate a LogisticRegression estimator in sklearn

In [27]:

Copied!

#create a RFE model and select 4 atribute 
#create a RFE model and select 4 atribute

Selected features: ['Pclass_1', 'Pclass_2', 'Sex_male', 'IsMinor']

In [28]:

Copied!

#create a RFE model and select 8 atribute 
#create a RFE model and select 8 atribute

Selected features: ['Age', 'TravelAlone', 'Pclass_1', 'Pclass_2', 'Embarked_C', 'Embarked_S', 'Sex_male', 'IsMinor']

Feature ranking with recursive feature elimination and cross-validation¶

RFECV runs the RFE method in a cross-validation loop to find the optimal number of variables. The RFECV method applies the parameters selected in the cross-validation to the logistic regression.

In [29]:

Copied!

#instantiate an RFECV object (with a scoring='accuracy') and do as stated above
#instantiate an RFECV object (with a scoring='accuracy') and do as stated above

Optimal number of features: 8
Selected features: ['Age', 'TravelAlone', 'Pclass_1', 'Pclass_2', 'Embarked_C', 'Embarked_S', 'Sex_male', 'IsMinor']

In [30]:

Copied!

#plot features numbers VS. cross-validation scores
#plot features numbers VS. cross-validation scores

Roll back on our model and evaluate metrics¶

The goal here is to re compute our model with the new 8 features selected by the RFECV method

In [31]:

Copied!

#create the new dataframe and print the shape
#create the new dataframe and print the shape

Out[31]:

((891, 8), (891,))

In [32]:

Copied!

#split your data with an alpha=20% and a random_state=42
#split your data with an alpha=20% and a random_state=42

In [33]:

Copied!

#instantiate a LogisticRegression estimator and fit it on your data
#instantiate a LogisticRegression estimator and fit it on your data

Out[33]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [34]:

Copied!

#make a prediction on the test set
#make a prediction on the test set

In [35]:

Copied!





#make a prediction with the predict_proba function and display
# the accuracy of the model
# the cross-entropy loss
# the air below the ROC curve (AUC)
# say what these metrics do ? 
#make a prediction with the predict_proba function and display
# the accuracy of the model
# the cross-entropy loss
# the air below the ROC curve (AUC)
# say what these metrics do ? 

Train/Test split results:
LogisticRegression accuracy is 0.804
LogisticRegression log_loss is 0.433
LogisticRegression auc is 0.872

ROC Curve and AUC Score¶

The ROC (receiver operating characteristic) curve allows you to visualize the performance of a binary classification model based on its discrimination criterion (the probability threshold from which the model estimates that an observation is classified as "positive"). It is a curve that represents the evolution of the true positive rate (TPR) as a function of the false positive rate (FPR) for different threshold values. The ROC curve is a graphical representation of the confusion matrix.

The AUC (Area Under the Curve) is the area under the ROC curve. It is a measure of the performance of a binary classification model. The higher the AUC, the better the model is at distinguishing between the two classes. The AUC is therefore a criterion for assessing the quality of a model.

Theses metrics are important to evaluate the performance of our model and to compare it with other models.

In [36]:

Copied!

#display roc curve
#what do you notice ?
#display roc curve
#what do you notice ?

Confusion Matrix¶

One of the quick and easy ways to measure your model's performance is through confusion matrices. The idea is to see the predictions your model got right as well as the false positives and false negatives. By summing up the errors over the total predictions, you get the accuracy rate of your model.

A simple and relevant measure of your model's performance would be to compare the model's accuracy rate with the proportion of positives in the database. Indeed, the simplest model in a classification problem is to classify all individuals in the same class. In the case of this trivial model, the accuracy rate will be equal to the proportion occupied by the majority group in the data. Suppose we have a database that gives the results of the baccalaureate for a sample population. If the sample contains 70% of individuals who passed their baccalaureate, then if our model predicts that everyone will pass, it is right 70% of the time. Therefore, it is only worth building a more complex model if its accuracy can be higher than 70%.

False Positives - False Negatives¶

Since our model is based on probabilities, it can sometimes be wrong. False positives and false negatives represent the errors made by our classification model.

False Positive Continuing with the above example, if our model categorizes person A as a "buyer" and in reality, this person does not buy the product, then we have a false positive. The model expected a positive result that did not occur.

False Negative Conversely, if person B, whom the model predicted as a non-buyer, actually buys the product, it's a false negative. We predicted a negative result, but it did not occur.

🚧 Be cautious of false positives and ESPECIALLY false negatives 🚧

In [ ]:

Copied!

#print the confusion matrix and the classification report (optional)
#print the confusion matrix and the classification report (optional)

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1c198048>

Using `cross_val_score()` function¶

The cross_val_score() function is a great way to test the accuracy of our model. It will split our data into 10 parts, train the model on 9 parts, and test it on the last part. It will do this 10 times, each time using a different part for testing. Then it will return the list of 10 scores obtained.

In [38]:

Copied!

#do the same with a cross-validation (10epoch) on your logistic regression
#use scroing accuracy, neg_log_loss, roc_auc as before
#do the same with a cross-validation (10epoch) on your logistic regression
#use scroing accuracy, neg_log_loss, roc_auc as before

K-fold cross-validation results for 10epochs :
LogisticRegression average accuracy is 0.802
LogisticRegression average log_loss is 0.454
LogisticRegression average auc is 0.850