Logistics Regression for Titanic data¶
Logistic regression is a binomial regression model. As with all binomial regression models, the aim is to best model a simple mathematical model with numerous real observations. In other words to associate with a vector of random variables a binomial random variable generically denoted y. Logistic regression is a special case of a generalized linear model. It is widely used in machine learning.
With linear regressions, we saw that our predictor was the line that our model drew. In a logistic regression, the line is simply a boundary that separates two categories. For example if a person will buy a product (represented by the number 1) or will not buy a product (represented by the number 0) the line will represent the probability that a person will buy or not according to his age.
You can consult this presentation of the University of Lile in order to have more details here
Some technical details¶
Unlike linear regressions that predict a number, classification models predict a category. For example, if you are trying to predict whether someone will buy a product based on certain independent variables, you are dealing with a classification problem because the categories you are trying to predict are "yes, the person will buy the product" or "no, the person will not buy my product". Logistic regressions are a category within classification models, but there are many others such as decision trees, SVM (support vector machine), or Naive Bayes.
When building a logistic regression model, we assume there is a function $f$ that links the target variable $Y$ to the explanatory variables represented in matrix $X$ as follows:
$$ P(Y=1)=f(X)+\epsilon $$ and $$ f(X) = \frac{1}{1\ +\ exp\bigl(-\bigl(\beta_{0}\ +\ X_{1}\beta_{1}\ +\ ...\ +\ X_{p}\beta_{p}\bigr)\bigr)} $$
where the form of the function $f$ is called the logistic function. The logistic function is a sigmoid function, which means that it is an S-shaped curve. The logistic function is defined as follows:
When we dealt with linear regressions, our predictor was the line drawn by our model. In logistic regression, the line is simply a boundary separating two categories. In the above graph, we are trying to determine if a person will buy a product (represented by the number 1) or will not buy a product (represented by the number 0). The line represents the probability that a person buys or does not buy based on their Age. The shape of the curve is a representation for an explanatory variable (here age) of the equation introduced earlier.
In this example, we only have one independent variable and a constant. The equation looks a lot like a linear regression, but here a logistic function is applied to the explanatory variable used for regression. This function constrains the values of $f(X^{T}\beta)$ to remain in the interval [0,1], which is the set of values that a probability can take. Based on the obtained probability, the algorithm will know which category to place our individual in. Generally, if the probability is greater than 0.5, the individual will be placed in the category 1, otherwise it will be placed in the category 0, we can also give the result as a percentage of probability.
How to classify¶
Now that we have drawn the line, we can begin our interpretations. Since our model is probabilistic this time, points with a probability greater than 50% will belong to category A, while points with a probability less than 50% will belong to category B. Depending on the problem considered, another threshold may be chosen. For example, in banking fraud issues, we tend to classify individuals with fraud probabilities less than 50% as fraudsters because we want to maximize the security of the banking system against fraudulent threats.
For instance, based on certain independent variables, we found that person A has a 60% chance of buying the product. She will therefore be considered an "buyer" by our model. On the other hand, if we have person B who only has a 45% chance of buying the product, she will be considered a "non-buyer".
Dummy Example in python with scikit-learn¶
from sklearn.linear_model import LogisticRegression
logisticreg = LogisticRegression() # defining the logistic regression model to apply to the data
logisticreg.fit(X, y) # model estimation
y_pred = logisticreg.predict(X) # model predictions
MSE = np.sqrt(np.mean((y_pred-y)**2)) # calculating the root mean squared error
compare_y_ypred = pd.DataFrame() # creating a dataframe to compare predictions and reality
compare_y_ypred['pred'] = y_pred
compare_y_ypred['y'] = y
logisticreg.score(X, y) # model accuracy
Example with Titanic data¶
#load the libraries and titanic data
#you can find it here : https://www.kaggle.com/c/titanic/data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#print the dataset shape
Taille du dataset d'entrainement : 891 Taille du dataset de test : 418
#plot the NaN values with a heatmap
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a9e7748>
Imputing missing values¶
Impuation is a method of replacing missing values with estimated values. There are several methods for imputing missing values, the most common of which are:
- Imputation by the mean
- Imputation by the median
- Imputation by the mode
- Imputation by the KNN method
- Imputation by the MICE method
- and many others, which you can find in this article
Let's begin by plotting the missing values in our dataset and understanding the distribution of missing values.
#print the age % of NaN values
Il y a 19.87% d'age qui manquent
#print the histogram of the age column
#print the median and the mean of the age column
moyenne : 29.70 medianne : 28.00
#print the cabin % of NaN values
Il y a 77.10% de Cabin qui manquent
#print the embarked distribution
Boarded passengers grouped by port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton): S 644 C 168 Q 77 Name: Embarked, dtype: int64
Define Train & Test data¶
We will make the following changes to the data:
- If "Age" is missing for a given row, we assign 28 (median age).
- If "Embarked" is missing for a given line, we assign "S" (most common embarked).
- We will ignore "Cabin" as a variable. There are too many missing values it wouldn't make sense to assign values to it.
#apply the preprocessing like above
#verify your results
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 dtype: int64
#print the histogram of the age column after and before processing
Feature engineering¶
They are some linked variable inside our dataset to whether or not the passenger travels with his family. For simplicity, we will combine these variables into a single categorical variable: whether this person was traveling alone or not.
#create the feature 'TravelAlone'
PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Embarked | TravelAlone | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | A/5 21171 | 7.2500 | S | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | PC 17599 | 71.2833 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | S | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 373450 | 8.0500 | S | 1 |
#use the get_dummies function to encode every needed features
PassengerId | Survived | Name | Age | Ticket | Fare | TravelAlone | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | Sex_female | Sex_male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | Braund, Mr. Owen Harris | 22.0 | A/5 21171 | 7.2500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | PC 17599 | 71.2833 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 3 | 1 | Heikkinen, Miss. Laina | 26.0 | STON/O2. 3101282 | 7.9250 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 113803 | 53.1000 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
4 | 5 | 0 | Allen, Mr. William Henry | 35.0 | 373450 | 8.0500 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
#drop the columns that you don't need
Survived | Age | Fare | TravelAlone | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | Sex_male | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 22.0 | 7.2500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 1 | 38.0 | 71.2833 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 26.0 | 7.9250 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 1 | 35.0 | 53.1000 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 35.0 | 8.0500 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
Do the same with the test dataset¶
A good practice is to create new datasets or a copy for each new operation.
#print the null value of this dataset
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
#do the same thing for the test dataset
Age | Fare | TravelAlone | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | Sex_male | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 34.5 | 7.8292 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
1 | 47.0 | 7.0000 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 62.0 | 9.6875 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
3 | 27.0 | 8.6625 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
4 | 22.0 | 12.2875 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
#print the age feature distribution and highlight the survived feature
#add the "IsMinor" variable to your data (a person is considered a minor if they are under 16)
Passenger Class¶
#show survivors by class
Embarked Port¶
#do the same for embarked feature
Traveling Alone vs. With Family¶
#do the same for the Traveling Alone feature we've created earlier
Gender Variable¶
#gender distribution of survivors
Logistic Regression with scikit-learn¶
We have seen with the lasso method the importance of feature selection.
For this lab, we will use the feature selection method from sklearn. Find out about sklearn's feature selection methods and explain how they work
#define x and y for the model
#instanciate a LogisticRegression estimator in sklearn
#create a RFE model and select 4 atribute
Selected features: ['Pclass_1', 'Pclass_2', 'Sex_male', 'IsMinor']
#create a RFE model and select 8 atribute
Selected features: ['Age', 'TravelAlone', 'Pclass_1', 'Pclass_2', 'Embarked_C', 'Embarked_S', 'Sex_male', 'IsMinor']
Feature ranking with recursive feature elimination and cross-validation¶
RFECV runs the RFE method in a cross-validation loop to find the optimal number of variables. The RFECV method applies the parameters selected in the cross-validation to the logistic regression.
#instantiate an RFECV object (with a scoring='accuracy') and do as stated above
Optimal number of features: 8 Selected features: ['Age', 'TravelAlone', 'Pclass_1', 'Pclass_2', 'Embarked_C', 'Embarked_S', 'Sex_male', 'IsMinor']
#plot features numbers VS. cross-validation scores
Roll back on our model and evaluate metrics¶
The goal here is to re compute our model with the new 8 features selected by the RFECV method
#create the new dataframe and print the shape
((891, 8), (891,))
#split your data with an alpha=20% and a random_state=42
#instantiate a LogisticRegression estimator and fit it on your data
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
#make a prediction on the test set
#make a prediction with the predict_proba function and display
# the accuracy of the model
# the cross-entropy loss
# the air below the ROC curve (AUC)
# say what these metrics do ?
Train/Test split results: LogisticRegression accuracy is 0.804 LogisticRegression log_loss is 0.433 LogisticRegression auc is 0.872
ROC Curve and AUC Score¶
The ROC (receiver operating characteristic) curve allows you to visualize the performance of a binary classification model based on its discrimination criterion (the probability threshold from which the model estimates that an observation is classified as "positive"). It is a curve that represents the evolution of the true positive rate (TPR) as a function of the false positive rate (FPR) for different threshold values. The ROC curve is a graphical representation of the confusion matrix.
The AUC (Area Under the Curve) is the area under the ROC curve. It is a measure of the performance of a binary classification model. The higher the AUC, the better the model is at distinguishing between the two classes. The AUC is therefore a criterion for assessing the quality of a model.
Theses metrics are important to evaluate the performance of our model and to compare it with other models.
#display roc curve
#what do you notice ?
Confusion Matrix¶
One of the quick and easy ways to measure your model's performance is through confusion matrices. The idea is to see the predictions your model got right as well as the false positives and false negatives. By summing up the errors over the total predictions, you get the accuracy rate of your model.
A simple and relevant measure of your model's performance would be to compare the model's accuracy rate with the proportion of positives in the database. Indeed, the simplest model in a classification problem is to classify all individuals in the same class. In the case of this trivial model, the accuracy rate will be equal to the proportion occupied by the majority group in the data. Suppose we have a database that gives the results of the baccalaureate for a sample population. If the sample contains 70% of individuals who passed their baccalaureate, then if our model predicts that everyone will pass, it is right 70% of the time. Therefore, it is only worth building a more complex model if its accuracy can be higher than 70%.
False Positives - False Negatives¶
Since our model is based on probabilities, it can sometimes be wrong. False positives and false negatives represent the errors made by our classification model.
False Positive Continuing with the above example, if our model categorizes person A as a "buyer" and in reality, this person does not buy the product, then we have a false positive. The model expected a positive result that did not occur.
False Negative Conversely, if person B, whom the model predicted as a non-buyer, actually buys the product, it's a false negative. We predicted a negative result, but it did not occur.
🚧 Be cautious of false positives and ESPECIALLY false negatives 🚧
#print the confusion matrix and the classification report (optional)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1c198048>
Using cross_val_score()
function¶
The cross_val_score()
function is a great way to test the accuracy of our model. It will split our data into 10 parts, train the model on 9 parts, and test it on the last part. It will do this 10 times, each time using a different part for testing. Then it will return the list of 10 scores obtained.
#do the same with a cross-validation (10epoch) on your logistic regression
#use scroing accuracy, neg_log_loss, roc_auc as before
K-fold cross-validation results for 10epochs : LogisticRegression average accuracy is 0.802 LogisticRegression average log_loss is 0.454 LogisticRegression average auc is 0.850