Regression Regularization Methods : Lasso & Ridge¶
In the field of mathematics and statistics, and more specifically in the field of machine learning, regularization refers to a process of adding information to a problem to prevent overfitting. This information generally takes the form of a penalty to the complexity of the model.
From a Bayesian point of view, the use of regularization amounts to imposing a prior distribution on the parameters of the model.
Before we begin, some guidelines.
Goals¶
Understand these two approaches by applying them to a practical case: the pricing of an apartment (for a change)!
🔥 Your mission if you choose to accept it is to completed with the comments accordingly 🔥
The Lasso method¶
In statistics, the lasso is a method of shrinkage of regression coefficients developed by Robert Tibshirani in an article published in 1996 entitled Regression shrinkage and selection via the lasso.
The Lasso method is widely used in
We seek to explain in a linear way a variable $Y$ , by $p$ potentially explanatory variables $X_i$. For that we make $n$ observations and we model the variable $Y$ in the following way: $$Y=X \beta + \varepsilon$$
The question now is to know which variable among the $p$ variables has the most weight in our explanation! This is therefore the object of the Lasso method.
For more details, I invite you to consult the heart of Pierre Gaillard and Anisse Ismaili, The Lasso, or how to choose among a large number of variables with the help of few observations
#importer vos libs
import warnings
warnings.simplefilter("ignore")
#load the houseData.csv dataset
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7129300520 | 20141013T000000 | 221900.0 | 3 | 1.00 | 1180 | 5650 | 1.0 | 0 | 0 | ... | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
1 | 6414100192 | 20141209T000000 | 538000.0 | 3 | 2.25 | 2570 | 7242 | 2.0 | 0 | 0 | ... | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
2 | 5631500400 | 20150225T000000 | 180000.0 | 2 | 1.00 | 770 | 10000 | 1.0 | 0 | 0 | ... | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
3 | 2487200875 | 20141209T000000 | 604000.0 | 4 | 3.00 | 1960 | 5000 | 1.0 | 0 | 0 | ... | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
4 | 1954400510 | 20150218T000000 | 510000.0 | 3 | 2.00 | 1680 | 8080 | 1.0 | 0 | 0 | ... | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
5 rows × 21 columns
#print the sample of the dataset
(21613, 21)
#drop the non necessary columns
(21613, 17)
#use seaborn library to plot the pairplot of the dataset and see the effect of the 'bedrooms' feature
<seaborn.axisgrid.PairGrid at 0x1a1a1e25f8>
<Figure size 1008x432 with 0 Axes>
# extract the values of the price (your target vector) into a new variable and display its size
taille du vecteur cible : 21613
#drop the price column from the dataset
#create a new dataframe with the columns you want to use to predict the price
#print the name of the columns of the new dataframe
['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']
#create a Lasso model with the alpha parameter equal to 0.2
#what is the effect of the alpha parameter on the model?
#do a fit on your dataset
Lasso(alpha=0.2, copy_X=True, fit_intercept=True, max_iter=1000, normalize=True, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
#print the coef_ of your model
[-4.60371681e+04 4.27861001e+04 2.09910162e+02 1.55917506e-01 1.10125833e+04 2.91820514e+04 1.01162856e+05 -2.23999188e+01 -3.02780899e+01 -2.82963779e+03 3.48449523e+01 -4.76908000e+02 5.55627700e+05 -2.51484241e+05 4.00319113e+01 -3.23987600e-01]
#plot the coef_ of your model
#what can you say about the coef_ of your model?
#try the same thing with a different alpha parameter
array([-2.21651222e+03, 3.76966686e+03, 1.70151388e+02, 0.00000000e+00, 0.00000000e+00, 1.01307729e+04, 1.02366456e+05, 0.00000000e+00, 0.00000000e+00, -1.94583213e+03, 1.53104933e+01, -0.00000000e+00, 4.70758627e+05, -7.68871412e+04, 2.19537362e+01, -0.00000000e+00])
#print the curve
#what can you say about this new curve?
#do the same with an alpha parameter equal to 1000 this time
array([ 0. , 0. , 93.61923199, 0. , 0. , 0. , 27594.4045107 , 0. , 0. , -0. , 0. , 0. , 0. , -0. , 0. , 0. ])
#plot the curve as well
Cross validation¶
Cross-validation (or cross-validation) is in machine learning, a method of estimating the reliability of a model based on a sampling technique.
Find out what cross-validation is and have it applied to your dataset, display your average score for 5 iterations
#print your cross validation table score and the Average 5-Fold CV Score
[0.64113769 0.65607482 0.6615889 0.67507893 0.65552482] Average 5-Fold CV Score: 0.6578810345485225
Regularized Ridge Regressions¶
The regularization consists in introducing a notion of penalty in our way of measuring the error (the sum of the errors squared for us) this therefore allows us to infer our parameters. This regularization term must be adjusted in order to obtain a better quality model.
Tikhonov's regularization, better known under the name of "ridge regression" is a method which consists in adding a constraint on the coefficients during the modeling to control the amplitude of their values ("to prevent them from go in all directions")
We will therefore try to apply Ridge to our data.
#define a range of 50 value from 10-4 to 1 with the logspace function
#what does numpy's logspace function do?
array([1.00000000e-04, 1.20679264e-04, 1.45634848e-04, 1.75751062e-04, 2.12095089e-04, 2.55954792e-04, 3.08884360e-04, 3.72759372e-04, 4.49843267e-04, 5.42867544e-04, 6.55128557e-04, 7.90604321e-04, 9.54095476e-04, 1.15139540e-03, 1.38949549e-03, 1.67683294e-03, 2.02358965e-03, 2.44205309e-03, 2.94705170e-03, 3.55648031e-03, 4.29193426e-03, 5.17947468e-03, 6.25055193e-03, 7.54312006e-03, 9.10298178e-03, 1.09854114e-02, 1.32571137e-02, 1.59985872e-02, 1.93069773e-02, 2.32995181e-02, 2.81176870e-02, 3.39322177e-02, 4.09491506e-02, 4.94171336e-02, 5.96362332e-02, 7.19685673e-02, 8.68511374e-02, 1.04811313e-01, 1.26485522e-01, 1.52641797e-01, 1.84206997e-01, 2.22299648e-01, 2.68269580e-01, 3.23745754e-01, 3.90693994e-01, 4.71486636e-01, 5.68986603e-01, 6.86648845e-01, 8.28642773e-01, 1.00000000e+00])
#create 2 arrays to store the results
# an array for the average cross-validation score (10 epoch)
# an array for the mean variance
#instantiate a normalized 'Ridge estimator'
#What is normalization for?
#loop over alphas
# --> goal: see the effect of the alpha parameter on the accuracy
#
# HINT
#
# for each alpha made 10 cross-validation
# add the results to your tables
#what do you notice ?
scores moyens : 0.6516384157603821 variance moyenne : 0.01739819241056429
Use the function below, explain it and comment on the result:
def display_plot(cv_scores, cv_scores_std):
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(alpha_space, cv_scores)
std_error = cv_scores_std / np.sqrt(10)
ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
ax.set_ylabel('CV Score +/- Std Error')
ax.set_xlabel('Alpha')
ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
ax.set_xlim([alpha_space[0], alpha_space[-1]])
ax.set_xscale('log')
plt.show()
#what do you see on this graph ?