Regression Regularization Methods : Lasso & Ridge¶

In the field of mathematics and statistics, and more specifically in the field of machine learning, regularization refers to a process of adding information to a problem to prevent overfitting. This information generally takes the form of a penalty to the complexity of the model.

From a Bayesian point of view, the use of regularization amounts to imposing a prior distribution on the parameters of the model.

Before we begin, some guidelines.

Goals¶

Understand these two approaches by applying them to a practical case: the pricing of an apartment (for a change)!

🔥 Your mission if you choose to accept it is to completed with the comments accordingly 🔥

The Lasso method¶

In statistics, the lasso is a method of shrinkage of regression coefficients developed by Robert Tibshirani in an article published in 1996 entitled Regression shrinkage and selection via the lasso.

The Lasso method is widely used in

We seek to explain in a linear way a variable $Y$ , by $p$ potentially explanatory variables $X_i$. For that we make $n$ observations and we model the variable $Y$ in the following way: $$Y=X \beta + \varepsilon$$

The question now is to know which variable among the $p$ variables has the most weight in our explanation! This is therefore the object of the Lasso method.

For more details, I invite you to consult the heart of Pierre Gaillard and Anisse Ismaili, The Lasso, or how to choose among a large number of variables with the help of few observations

In [ ]:

Copied!

#importer vos libs 
import warnings
warnings.simplefilter("ignore")
#importer vos libs 
import warnings
warnings.simplefilter("ignore")

In [2]:

Copied!

#load the houseData.csv dataset
#load the houseData.csv dataset

Out[2]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900.0	3	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000.0	3	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000.0	2	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000.0	4	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000.0	3	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

5 rows × 21 columns

In [3]:

Copied!

#print the sample of the dataset
#print the sample of the dataset

Out[3]:

(21613, 21)

In [4]:

Copied!

#drop the non necessary columns
#drop the non necessary columns

Out[4]:

(21613, 17)

In [5]:

Copied!

#use seaborn library to plot the pairplot of the dataset and see the effect of the 'bedrooms' feature 
#use seaborn library to plot the pairplot of the dataset and see the effect of the 'bedrooms' feature

Out[5]:

<seaborn.axisgrid.PairGrid at 0x1a1a1e25f8>

<Figure size 1008x432 with 0 Axes>

No description has been provided for this image

In [6]:

Copied!

# extract the values ​​of the price (your target vector) into a new variable and display its size
# extract the values ​​of the price (your target vector) into a new variable and display its size

taille du vecteur cible : 21613

In [ ]:

Copied!

#drop the price column from the dataset 
#drop the price column from the dataset

In [8]:

Copied!

#create a new dataframe with the columns you want to use to predict the price
#create a new dataframe with the columns you want to use to predict the price

In [9]:

Copied!

#print the name of the columns of the new dataframe
#print the name of the columns of the new dataframe

Out[9]:

['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'zipcode',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15']

In [10]:

Copied!

#create a Lasso model with the alpha parameter equal to 0.2
#what is the effect of the alpha parameter on the model?
#create a Lasso model with the alpha parameter equal to 0.2
#what is the effect of the alpha parameter on the model?

In [11]:

Copied!

#do a fit on your dataset
#do a fit on your dataset

Out[11]:

Lasso(alpha=0.2, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [12]:

Copied!

#print the coef_ of your model
#print the coef_ of your model

[-4.60371681e+04  4.27861001e+04  2.09910162e+02  1.55917506e-01
  1.10125833e+04  2.91820514e+04  1.01162856e+05 -2.23999188e+01
 -3.02780899e+01 -2.82963779e+03  3.48449523e+01 -4.76908000e+02
  5.55627700e+05 -2.51484241e+05  4.00319113e+01 -3.23987600e-01]

In [13]:

Copied!

#plot the coef_ of your model
#what can you say about the coef_ of your model?
#plot the coef_ of your model
#what can you say about the coef_ of your model?

In [14]:

Copied!

#try the same thing with a different alpha parameter
#try the same thing with a different alpha parameter

Out[14]:

array([-2.21651222e+03,  3.76966686e+03,  1.70151388e+02,  0.00000000e+00,
        0.00000000e+00,  1.01307729e+04,  1.02366456e+05,  0.00000000e+00,
        0.00000000e+00, -1.94583213e+03,  1.53104933e+01, -0.00000000e+00,
        4.70758627e+05, -7.68871412e+04,  2.19537362e+01, -0.00000000e+00])

In [15]:

Copied!

#print the curve    
#what can you say about this new curve?
#print the curve    
#what can you say about this new curve?

In [16]:

Copied!

#do the same with an alpha parameter equal to 1000 this time
#do the same with an alpha parameter equal to 1000 this time

Out[16]:

array([    0.        ,     0.        ,    93.61923199,     0.        ,
           0.        ,     0.        , 27594.4045107 ,     0.        ,
           0.        ,    -0.        ,     0.        ,     0.        ,
           0.        ,    -0.        ,     0.        ,     0.        ])

In [17]:

Copied!

#plot the curve as well
#plot the curve as well

Cross validation¶

Cross-validation (or cross-validation) is in machine learning, a method of estimating the reliability of a model based on a sampling technique.

Find out what cross-validation is and have it applied to your dataset, display your average score for 5 iterations

In [18]:

Copied!

#print your cross validation table score and the Average 5-Fold CV Score
#print your cross validation table score and the Average 5-Fold CV Score

[0.64113769 0.65607482 0.6615889  0.67507893 0.65552482]
Average 5-Fold CV Score: 0.6578810345485225

Regularized Ridge Regressions¶

The regularization consists in introducing a notion of penalty in our way of measuring the error (the sum of the errors squared for us) this therefore allows us to infer our parameters. This regularization term must be adjusted in order to obtain a better quality model.

Tikhonov's regularization, better known under the name of "ridge regression" is a method which consists in adding a constraint on the coefficients during the modeling to control the amplitude of their values ("to prevent them from go in all directions")

We will therefore try to apply Ridge to our data.

In [19]:

Copied!

#define a range of 50 value from 10-4 to 1 with the logspace function
#what does numpy's logspace function do?
#define a range of 50 value from 10-4 to 1 with the logspace function
#what does numpy's logspace function do?

Out[19]:

array([1.00000000e-04, 1.20679264e-04, 1.45634848e-04, 1.75751062e-04,
       2.12095089e-04, 2.55954792e-04, 3.08884360e-04, 3.72759372e-04,
       4.49843267e-04, 5.42867544e-04, 6.55128557e-04, 7.90604321e-04,
       9.54095476e-04, 1.15139540e-03, 1.38949549e-03, 1.67683294e-03,
       2.02358965e-03, 2.44205309e-03, 2.94705170e-03, 3.55648031e-03,
       4.29193426e-03, 5.17947468e-03, 6.25055193e-03, 7.54312006e-03,
       9.10298178e-03, 1.09854114e-02, 1.32571137e-02, 1.59985872e-02,
       1.93069773e-02, 2.32995181e-02, 2.81176870e-02, 3.39322177e-02,
       4.09491506e-02, 4.94171336e-02, 5.96362332e-02, 7.19685673e-02,
       8.68511374e-02, 1.04811313e-01, 1.26485522e-01, 1.52641797e-01,
       1.84206997e-01, 2.22299648e-01, 2.68269580e-01, 3.23745754e-01,
       3.90693994e-01, 4.71486636e-01, 5.68986603e-01, 6.86648845e-01,
       8.28642773e-01, 1.00000000e+00])

In [20]:

Copied!

#create 2 arrays to store the results
# an array for the average cross-validation score (10 epoch)
# an array for the mean variance
#create 2 arrays to store the results
# an array for the average cross-validation score (10 epoch)
# an array for the mean variance

In [21]:

Copied!

#instantiate a normalized 'Ridge estimator'
#What is normalization for?
#instantiate a normalized 'Ridge estimator'
#What is normalization for?

In [22]:

Copied!





#loop over alphas
# --> goal: see the effect of the alpha parameter on the accuracy
#
#        HINT
#
# for each alpha made 10 cross-validation
# add the results to your tables
#what do you notice ?
#loop over alphas
# --> goal: see the effect of the alpha parameter on the accuracy
#
#        HINT
#
# for each alpha made 10 cross-validation
# add the results to your tables
#what do you notice ?

scores moyens : 0.6516384157603821

variance moyenne : 0.01739819241056429

Use the function below, explain it and comment on the result:

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

In [23]:

Copied!

#what do you see on this graph ?
#what do you see on this graph ?