Simple Linear Regression for salary prediction example¶
A simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.
Why linear regression is so popular in Data Science¶
- Simplicity: It's straightforward and easy to understand. Even if more complex models are required, a linear regression analysis is often performed as a first step to understand the relationships between variables.
- Interpretability: The coefficients of the model are interpretable. They represent the change in the dependent variable for a one-unit change in the predictor, holding other predictors constant.
- Predictive Power: Despite its simplicity, it can be quite powerful. In situations where the relationship between variables is approximately linear, it can provide a highly accurate predictive model.
- Basis for Other Methods: Many advanced methods in statistics and machine learning can be seen as extensions or generalizations of linear regression.
Real world examples of Using Linear Regression¶
- Real Estate Pricing: Predicting house prices based on features like size, location, number of rooms, etc. Here, the price is the dependent variable, and the features are the independent variables.
- Sales Forecasting: Predicting future sales based on advertising spend. The sales are the dependent variable, and the advertising spend is the independent variable.
- Risk Assessment in Finance: Predicting the risk of a loan or credit based on the applicant's financial history. Predicting Exam Scores: Predicting a student's future exam score based on the number of hours they study.
Some technical details¶
In the simple linear model (a single explanatory variable), the response variable is assumed to follow the following pattern:
$$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$$
Note the resemblance to the affine function presented above. The difference lies in the existence of the random term (called noise) $\varepsilon_i$. In order to consider the model, it is necessary to place oneself under the following assumptions
$$(\mathcal{H}): \left\{\begin{matrix} \mathbb{E}[\varepsilon_i]=0\\ \text{Cov}(\varepsilon_i, \varepsilon_j)=\delta_{ij} \sigma^2 \end{matrix}\right.$$ The different elements involved are:
- $\beta_0$: the ordinate at the origin (named intercept)
- $\beta_1$: the leading coefficient
- $x_i$: the observation $i$
- $y_i$: the $i$-th price
- $\varepsilon_i$: the random noise linked to the $i$-th observation
The solution can easily be calculated using the following closed formulas:
$$\hat{\beta}_1=\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^ n (x_i - \bar{x})^2} \qquad \hat{\beta}_0 = \hat{y} - \hat{\beta}_1 \bar{x}$$
Simple example in Python¶
Let's implement a simple example with some random data. We will use the scikit-learn
library to do this. Let's take the salary prediction example. We will try to predict the salary of a developer based on their experience.
#data generation
import numpy as np
X=np.array([0,3,6,8])
Y=np.array([35,45,65,80])
import matplotlib.pyplot as plt
plt.plot(X,Y,'*')
plt.xlabel("Years of exeprience | our explicative variable 'x' ")
plt.ylabel("Salary | Target variable 'y'")
plt.title("Scratter plot")
plt.savefig("./intuitive_scatter.png")
def reg_plot(x,y,m):
plt.scatter(x,y,c='blue',label="our data")
plt.plot(x, m.predict(x.reshape(-1, 1)), color='red',label="prediction curve")
plt.xlabel("explicative variable 'x' ")
plt.ylabel("target variable 'y'")
plt.legend()
return None
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X.reshape(-1, 1),Y)
reg_plot(X,Y,linear_model)
plt.savefig("./approche_intuitive.png")
Example with more data¶
Let's generate linear data for our example with the numpy library and the functions:
import numpy as np
x=np.arange(75)
delta = np.random.uniform(-10,10, size=(75,))
y = 0.4 * x +3 + delta
Fast visualization of data with the plot()
function of the matplotlib library
plt.plot(x,y,"*")
plt.xlabel("explicative variable 'x' ")
plt.ylabel("target variable 'y'")
plt.title("Scatter plot")
plt.savefig("./intuitive_scatter_bis.png")
Using the scikit-learn
library¶
We import the scikit-learn library in order to fetch the regression algorithm in the form of a function.
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
#we are using the reshape function to convert the 1D array to a 2D array which is an obligation in scikit-learn LinearRegression() cf official doc
linear_model.fit(x.reshape(-1, 1),y)
LinearRegression()
reg_plot(x,y,linear_model)
plt.savefig("./prediction.png")
Practical example: predict the salary based on years of exeprience¶
Import the pandas library, and use this data : https://www.kaggle.com/datasets/rohankayan/years-of-experience-and-salary-dataset
First, let's display the head()
of the Dataframe:
import pandas as pd
df = pd.read_csv("../data/Salary_Data.csv", sep=',')
df.head()
YearsExperience | Salary | |
---|---|---|
0 | 1.1 | 39343.0 |
1 | 1.3 | 46205.0 |
2 | 1.5 | 37731.0 |
3 | 2.0 | 43525.0 |
4 | 2.2 | 39891.0 |
We select the variable to predict and the explanatory variable with a Pandas mask as follows:
PS: In our case we have only two columns in this dataset and one explanatory variable, but in general we can have several, we will see that in the next section on multiple regression.
df=df[["YearsExperience","Salary"]]
We will now display the point cloud to see if it makes sense to correlate these two variables :
X=df.YearsExperience
Y=df.Salary
plt.plot(X,Y,'*')
plt.xlabel("Years of exeprience | our explicative variable 'x' ")
plt.ylabel("Salary | Target variable 'y'")
plt.title("Years of experience vs Salary")
Text(0.5, 1.0, 'Years of experience vs Salary')
We now import the LinearRegression estimator to make a fit on our data.
🚧 Be careful not to forget the .reshape(-1, 1) method because your explanatory variable is in one dimension 🚧
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(np.array(X).reshape(-1, 1),np.array(Y))
LinearRegression()
Enfin on utilise la fonction reg_plot pour afficher notre droite de prédiction tel que :
reg_plot(np.array(X),np.array(Y),linear_model)
Then we can calculate our metrics score for our model:
from sklearn.metrics import mean_squared_error, r2_score
Y_pred = linear_model.predict(np.array(X).reshape(-1, 1))
print("Mean squared error: %.2f"
% mean_squared_error(Y, Y_pred))
print("Root mean squared error: %.2f"
% np.sqrt(mean_squared_error(Y, Y_pred)))
print("R square: %.2f"% r2_score(Y, Y_pred))
Mean squared error: 17481710.59 Root mean squared error: 4181.11 R square: 0.49