Polynomial Regression¶
Introduction Polynomial regression is a form of regression analysis like the multiple linear regression. Polynomial regression fits a nonlinear relationship between the value of $x$ and the corresponding $y$ modeled as an $n^{th}$ degree polynomial in $x$.
It is considered a special case of multiple linear regression. In scikit-learn, polynomial regression is implemented as a subset of linear regression through the preprocessing function PolynomialFeatures.
Intuition¶
Polynomial regression allows for the modeling of non-linear relationships between the explanatory variables and the target variable. For instance, in a linear regression model in order to predict the sales of a fast-food restaurant, the explanatory variables could be "Fries", "Tacos", and "Kebab" and they can be used to predict the explained variable "Sales". In a polynomial model of degree 2 the features can be used, allowing for the capture of non-linearity in the problem.
However, the challenge is basicly the degree of the polynomial, as the order of the model increases, its complexity also increases, making it prone to overfitting and difficult to interpret. The Akaike Information Criterion (AIC) can be used to strike a balance between overfitting and underfitting by considering both the model's performance and complexity.
Mathematical Background¶
Linear vs. Polynomial Regression: Linear regression models the relationship between two variables as a straight line. Polynomial regression, on the other hand, can model curved relationships. Equation of Polynomial Regression: For a polynomial of degree $n$ :
$$y = \beta_0 + \beta_1x + \beta_2x^2 + ... + \beta_{n}x^n$$
where $y$ is the dependent variable, $x$ is the independent variable, and $\beta_n$ is the coefficient of the independent variable.
Akaike Information Criterion (AIC)¶
The Akaike Information Criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. It is a trade-off between the goodness of fit of the model and the complexity of the model. The AIC is calculated using the following formula:
$$ AIC = 2k - 2ln(\hat{L}) $$
where $k$ is the number of parameters in the model and $\hat{L}$ is the maximum value of the likelihood function for the model.
The AIC is a relative measure of model quality, so it can be used to compare different models for the same data. The model with the lowest AIC is considered the best model for the data. The rationale behind AIC is to penalize the addition of unnecessary parameters, which can lead to overfitting. By doing so, it strikes a balance between model complexity and goodness of fit.
Implementing Polynomial Regression with dummy data¶
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import warnings
warnings.simplefilter("ignore")
# Generate sample data
X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
# Fit the model
model = LinearRegression()
model.fit(X, y)
# Calculate AIC
n = len(y)
k = len(model.coef_) + 1 # Adding 1 for the intercept
mse = np.mean((model.predict(X) - y)**2)
log_likelihood = -n/2 * np.log(2 * np.pi * mse) - n/2
aic = 2*k - 2*log_likelihood
print(f"AIC for Linear Regression: {aic}")
AIC for Linear Regression: -164.0186979179379
Measure AIC for linear and polynomial regression models
from sklearn.preprocessing import PolynomialFeatures
# Generate polynomial features
poly = PolynomialFeatures(degree=12)
X_poly = poly.fit_transform(X)
# Fit the model
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
# Calculate AIC for polynomial regression
k_poly = len(model_poly.coef_)
mse_poly = np.mean((model_poly.predict(X_poly) - y)**2)
log_likelihood_poly = -n/2 * np.log(2 * np.pi * mse_poly) - n/2
aic_poly = 2*k_poly - 2*log_likelihood_poly
print(f"AIC for Polynomial Regression: {aic_poly}")
AIC for Polynomial Regression: -157.326451973431
Implementing Polynomial Regression with real data¶
Let's use the "Auto MPG" dataset from the UCI Machine Learning Repository is a good choice. It contains information about various car attributes, such as horsepower, weight, and displacement, and aims to predict the miles per gallon (MPG) of the car, which is a continuous variable.
Auto MPG Dataset: here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('../data/auto+mpg/auto-mpg.data', delim_whitespace=True, header=None)
data.columns = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin', 'Car Name']
# Convert 'Horsepower' to numeric and handle missing values
data['Horsepower'] = pd.to_numeric(data['Horsepower'], errors='coerce')
data.dropna(inplace=True)
# Split data
X = data[['Horsepower']]
y = data['MPG']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Polynomial feature transformation
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)
# Predict
y_pred = model.predict(X_test_poly)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# Visualization
plt.scatter(X_test, y_test, color='blue', label='Actual Data')
# Generate a sequence of values for 'Horsepower' and predict 'MPG' for plotting the curve
x_sequence = np.linspace(X['Horsepower'].min(), X['Horsepower'].max(), 300).reshape(-1, 1)
y_sequence_pred = model.predict(poly.transform(x_sequence))
plt.plot(x_sequence, y_sequence_pred, color='red', label='Polynomial Regression Fit')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression Fit')
plt.legend()
plt.show()
Mean Squared Error: 18.416967796017992
Evaluate the model¶
Evaluating a regression model typically involves looking at several metrics to understand both the accuracy and the goodness of fit. Here's a function that computes and prints a variety of metrics for evaluating regression models.
In order to do so let's implement the following function here :
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def evaluate_regression(y_true, y_pred):
"""
Evaluate a regression model by computing and printing various metrics.
Parameters:
- y_true: Actual target values.
- y_pred: Predicted target values.
"""
# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_true, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
# Mean Squared Error (MSE)
mse = mean_squared_error(y_true, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
# Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_true, y_pred, squared=False)
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
# R-squared (Coefficient of Determination)
r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2:.2f}")
# Adjusted R-squared
n = len(y_true)
p = 2 # Change this to the number of predictors/features used in the model
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"Adjusted R-squared: {adj_r2:.2f}")
evaluate_regression(y_test, y_pred)
Mean Absolute Error (MAE): 3.26 Mean Squared Error (MSE): 18.42 Root Mean Squared Error (RMSE): 4.29 R-squared: 0.64 Adjusted R-squared: 0.63