Classification and Regression Trees (CART)¶

Overview¶

This course aims to make you understand the theoretical principles of decision trees and to make you code a simple example of application.

Decision trees are part of the field of recursive partitioning methods, which are better known by the acronym CART (Classification And Regression Tree) which make it possible to establish successive rules making it possible to classify observations or to make regressions. .

We speak of classification when the target variable is qualitative (categorical), a case that we will deal with later. In this introduction we are interested in the Regression Trees which intervene in the case of a quantitative target variable. Their advantage lies in particular in their easily readable graphical representation, which makes them good choices for modeling business issues.

A tree is made up of three types of elements:

The root, where all the training data resides.
The nodes/branches, which represents the points from the root where the data are separated into two groups according to a criterion linked to the explanatory variables.
The leaves, which are the terminal nodes of the tree and to which are associated a value in the case where Y is quantitative and a class when Y is qualitative.

Thus, from the root, a node is defined which divides all the data according to a criterion linked to an explanatory variable, for each of the two branches thus created, the same process is repeated, and so on until no division does not satisfy the construction criterion of a node and a terminal or leaf node is then defined.

./CART.png

Here is an example of a tree with the root at the top and the first division and the sequence of the branches, up to the leaves. In addition, one immediately notices the very clear and visual aspect of the tree. Let's imagine that we have a program that requests three values from a user and that aims to classify these three values. To answer the problem we use a decision tree like the one above.

Let's see how to read this tree representation, for that you have to start by looking at the node at the very top and check if the condition applies. As indicated in the diagram, if so we go to the left towards the lower node to check (or not) the new condition. And this until there is no lower node.

To do this algorithm needs several things:

A criterion for selecting the best division
A rule to define if a node is terminal
A method to assign each sheet a value or a class

The mathematical notation of the problem: the division criterion¶

As stated in the previous part, in order to build a decision tree, we need a criterion for selecting the best division for a node.

A node is eligible if the resulting branches carry non-empty nodes (i.e. at least one observation belongs to each of the child nodes). For a parent node containing M observations there are M - 1 admissible divisions if the variable selected for the division is quantitative or qualitative ordinal, and admissible divisions if the selected variable is nominal.

In order to select the best admissible division, we construct a heterogeneity function which has two remarkable properties:

It is zero when all the individuals belong to the same modality or have the same value of Y.
It is maximum when the values of Y are evenly distributed in the node.

We therefore seek the division which minimizes the sum of the heterogeneities of the child nodes.

The mathematical notation of the problem: the stopping criterion¶

A given node will be terminal when it is homogeneous, that is to say that all the observations in the node present the same value or the same modality of Y, when there are no more admissible divisions, or when the number observations in the node is lower than a value defined in advance, in general of the order of a few units.

If Y is a quantitative variable In the case of regression, the heterogeneity of node k is written as follows:

$$H_k=\frac{1}{Card(k)}\cdot\sum_{i\in{k}}(y_i-\bar{y_k})^2$$

$Card(k)$ sometimes also denoted #(k) or $|k|$ is the number of elements in the node $k$, and $\bar{y}_k$ is the average of the values of $Y$ among the observations of the node $k$. What makes the variance of the node $k$. The division retained is that for which: $H_{kG}+H_{kD}$ the sum of the heterogeneity of the left branch and the right branch is minimal. Because let us recall the the objective is to divide the individuals into two groups that are the most homogeneous with respect to our target variable. This may seem complex but you will see the python tools make our task much easier.

If Y is a qualitative variable Let Y be a qualitative variable with m modalities or categories numbered from 1 to m. The preferred heterogeneity function most of the time is the concentration of GINI which is noted as follows:

$$H_k=\sum_{i=1}^{m}p_{k}^{i}\cdot(1-p_{k}^{i})$$

Where $p^i_k$ is the proportion of individuals of class $i$ of $Y$ in node $k$.

As before, for each node it is a question of finding, among the admissible divisions, the one that maximizes the decrease in heterogeneity.

Pruning a tree¶

The problem with regression models like LASSO and RIDGE is that there is a risk, as with any supervised learning problem, of over-learning.

As a reminder, overfitting means that your algorithm is so trained on your data that it loses its ability to generalize its predictions. The stopping criterion defined for the construction of the tree is often conducive to over-fitting since it is very likely that most of the leaves of the tree contain only a few observations, we will illustrate this example later.

Thus the decision tree as it is will be very unstable, its bias is almost zero or even zero by definition, it depends very strongly on the observations of the learning base and will potentially not be generalizable to new data.

Our problem is therefore to find an intermediate tree that satisfies an interesting bias-variance compromise for the purposes of estimating $Y$. This is why we are going to see the random drills (in sets of trees) which will precisely respond to this problem. We will see random drills in the next part, don't panic.

Le critere de GINI¶

The Gini index, also known as the Gini coefficient or Gini impurity, is a measure of inequality or impurity in a dataset. In the context of decision trees, it's used to quantify the impurity of a node, which in turn helps us determine the best feature to split the dataset on.

The Gini index ranges between 0 and 1, with 0 representing perfect purity (all instances in a node belong to a single class) and 1 representing maximum impurity (instances are distributed uniformly across all classes).

The CART algorithm can uses the Gini index to choose the best split at each node of the tree. The algorithm calculates the Gini index for each potential split and selects the split that results in the lowest weighted Gini index.

Exemple de code `gini_index` function¶

Now let's dive into some Python examples to illustrate these concepts. First, we'll start with a simple function to calculate the Gini index of a dataset:

In [2]:

Copied!





def gini_index(groups, classes):
    total_instances = sum([len(group) for group in groups])
    gini = 0.0
    
    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue
        
        score = 0.0
        for class_val in classes:
            proportion = [row[-1] for row in group].count(class_val) / group_size
            score += proportion * proportion
        
        gini += (1 - score) * (group_size / total_instances)
    
    return gini

#Now let's consider a toy dataset with two features and a binary class label (A,B)

dataset = [
    [1, 1, 'A'],
    [1, 2, 'A'],
    [2, 1, 'B'],
    [2, 2, 'B'],
    [3, 1, 'A'],
    [3, 2, 'A'],
    [4, 1, 'B'],
    [4, 2, 'B'],
]
def gini_index(groups, classes):
    total_instances = sum([len(group) for group in groups])
    gini = 0.0
    
    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue
        
        score = 0.0
        for class_val in classes:
            proportion = [row[-1] for row in group].count(class_val) / group_size
            score += proportion * proportion
        
        gini += (1 - score) * (group_size / total_instances)
    
    return gini

#Now let's consider a toy dataset with two features and a binary class label (A,B)

dataset = [
    [1, 1, 'A'],
    [1, 2, 'A'],
    [2, 1, 'B'],
    [2, 2, 'B'],
    [3, 1, 'A'],
    [3, 2, 'A'],
    [4, 1, 'B'],
    [4, 2, 'B'],
]

We can now use our gini_index() function to evaluate the Gini index for different splits:

In [3]:

Copied!





# Example splits:
# Split 1: [1, 1, 'A'], [1, 2, 'A'] | [2, 1, 'B'], [2, 2, 'B'], [3, 1, 'A'], [3, 2, 'A'], [4, 1, 'B'], [4, 2, 'B']
# Split 2: [1, 1, 'A'], [1, 2, 'A'], [2, 1, 'B'], [2, 2, 'B'] | [3, 1, 'A'], [3, 2, 'A'], [4, 1, 'B'], [4, 2, 'B']

split1 = (dataset[:2], dataset[2:])
split2 = (dataset[:4], dataset[4:])
classes = list(set(row[-1] for row in dataset))

print("Gini index for split 1:", gini_index(split1, classes))
print("Gini index for split 2:", gini_index(split2, classes))
# Example splits:
# Split 1: [1, 1, 'A'], [1, 2, 'A'] | [2, 1, 'B'], [2, 2, 'B'], [3, 1, 'A'], [3, 2, 'A'], [4, 1, 'B'], [4, 2, 'B']
# Split 2: [1, 1, 'A'], [1, 2, 'A'], [2, 1, 'B'], [2, 2, 'B'] | [3, 1, 'A'], [3, 2, 'A'], [4, 1, 'B'], [4, 2, 'B']

split1 = (dataset[:2], dataset[2:])
split2 = (dataset[:4], dataset[4:])
classes = list(set(row[-1] for row in dataset))

print("Gini index for split 1:", gini_index(split1, classes))
print("Gini index for split 2:", gini_index(split2, classes))

Gini index for split 1: 0.3333333333333333
Gini index for split 2: 0.5

In this case, the second split results in a Gini index of 0, which means it perfectly separates the classes. The CART algorithm would choose the minimal possible value for a split.

Other CART split metrics for classification¶

In addition to the Gini index, there are other metrics used for splitting nodes in decision trees. Two popular alternatives are Information Gain and Gain Ratio. Let's discuss each of them.

Information Gain¶

Information Gain is based on the concept of entropy, which measures the impurity of an input set. In the context of decision trees, it measures the reduction in entropy achieved by partitioning a dataset based on a particular feature. The attribute with the highest Information Gain is selected as the splitting attribute.

Entropy is calculated using the following formula:

$$ Entropy(S) = -Σ P(c) * log2(P(c))$$

Where S is the dataset, c represents a class, and P(c) is the proportion of instances belonging to class c.

Information Gain is calculated as:

$$ InformationGain(S, A) = Entropy(S) - Σ (|Sv| / |S|) * Entropy(Sv) $$

Where S is the dataset, A is the attribute we are considering, Sv is the subset of S where the attribute A has value v, and $|Sv| / |S|$ is the proportion of instances in Sv.

Gain Ratio¶

The Gain Ratio is an extension of Information Gain that takes into account the intrinsic information of an attribute. It is useful in dealing with the bias of Information Gain towards attributes with more distinct values.

Gain Ratio is calculated as:

$$ GainRatio(S, A) = InformationGain(S, A) / SplitInformation(S, A) $$

Where SplitInformation(S, A) is the entropy of the attribute A.

Now, let's discuss the differences between Gini and other metrics:

Gini index and Information Gain both measure impurity reduction, but Gini index is based on the probability of misclassifying an instance, while Information Gain is based on the concept of entropy. In general, Gini index tends to be more efficient computationally, while Information Gain can be more accurate in some cases.
Gain Ratio addresses the bias of Information Gain towards attributes with more distinct values. It normalizes the Information Gain by considering the intrinsic information of the attribute.
The choice of impurity measure may impact the structure of the resulting decision tree. However, in practice, the choice between Gini index, Information Gain, and Gain Ratio often leads to similar decision trees, and the performance differences are generally small.

It's important to experiment with different impurity measures and evaluate their impact on the performance of the decision tree for a specific dataset or problem.

CART metrics for regression¶

In the case of regression trees, CART uses different metrics to measure the quality of a split. Instead of measuring impurity like in classification, it measures the variation in the target variable. Two common metrics for regression trees are Mean Squared Error (MSE) and Mean Absolute Error (MAE).

We already seen the concept of MSE in the regression course, let's try to code it in a simple way in order to illustrate the difference between MSE and MAE on a simple toy test dataset.

The MAE measures the average absolute difference between the predicted and true target values. Like the MSE, it is calculated as:

$$MAE = (1 / N) * Σ |y_i - y_pred|$$

Let's code this :

In [6]:

Copied!





def mean_absolute_error(groups):
    total_instances = sum([len(group) for group in groups])
    mae = 0.0
    
    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue

        group_mean = sum([row[-1] for row in group]) / group_size
        group_mae = sum([abs(row[-1] - group_mean) for row in group]) / group_size
        mae += group_mae * (group_size / total_instances)
    
    return mae


def mean_squared_error(groups):
    total_instances = sum([len(group) for group in groups])
    mse = 0.0

    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue

        group_mean = sum([row[-1] for row in group]) / group_size
        group_mse = sum([(row[-1] - group_mean) ** 2 for row in group]) / group_size
        mse += group_mse * (group_size / total_instances)

    return mse


#Now, let's consider a toy dataset with two features and a continuous target variable for the MAE illustration

dataset = [
    [1, 1, 10],
    [1, 2, 12],
    [2, 1, 20],
    [2, 2, 22],
    [3, 1, 30],
    [3, 2, 32],
    [4, 1, 40],
    [4, 2, 42],
]

# Example splits:
# Split 1: [1, 1, 10], [1, 2, 12] | [2, 1, 20], [2, 2, 22], [3, 1, 30], [3, 2, 32], [4, 1, 40], [4, 2, 42]
# Split 2: [1, 1, 10], [1, 2, 12], [2, 1, 20], [2, 2, 22] | [3, 1, 30], [3, 2, 32], [4, 1, 40], [4, 2, 42]

split1 = (dataset[:2], dataset[2:])
split2 = (dataset[:4], dataset[4:])

print("Mean Absolute Error for split 1:", mean_absolute_error(split1))
print("Mean Absolute Error for split 2:", mean_absolute_error(split2))
def mean_absolute_error(groups):
    total_instances = sum([len(group) for group in groups])
    mae = 0.0
    
    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue

        group_mean = sum([row[-1] for row in group]) / group_size
        group_mae = sum([abs(row[-1] - group_mean) for row in group]) / group_size
        mae += group_mae * (group_size / total_instances)
    
    return mae


def mean_squared_error(groups):
    total_instances = sum([len(group) for group in groups])
    mse = 0.0

    for group in groups:
        group_size = len(group)
        if group_size == 0:
            continue

        group_mean = sum([row[-1] for row in group]) / group_size
        group_mse = sum([(row[-1] - group_mean) ** 2 for row in group]) / group_size
        mse += group_mse * (group_size / total_instances)

    return mse


#Now, let's consider a toy dataset with two features and a continuous target variable for the MAE illustration

dataset = [
    [1, 1, 10],
    [1, 2, 12],
    [2, 1, 20],
    [2, 2, 22],
    [3, 1, 30],
    [3, 2, 32],
    [4, 1, 40],
    [4, 2, 42],
]

# Example splits:
# Split 1: [1, 1, 10], [1, 2, 12] | [2, 1, 20], [2, 2, 22], [3, 1, 30], [3, 2, 32], [4, 1, 40], [4, 2, 42]
# Split 2: [1, 1, 10], [1, 2, 12], [2, 1, 20], [2, 2, 22] | [3, 1, 30], [3, 2, 32], [4, 1, 40], [4, 2, 42]

split1 = (dataset[:2], dataset[2:])
split2 = (dataset[:4], dataset[4:])

print("Mean Absolute Error for split 1:", mean_absolute_error(split1))
print("Mean Absolute Error for split 2:", mean_absolute_error(split2))

Mean Absolute Error for split 1: 5.5
Mean Absolute Error for split 2: 5.0

In this example, the second split results in a lower MAE, indicating that it's a better split for the regression problem.

Now, let's create a toy dataset of points and calculate the MAE and MSE for various splits and compare the results.

In [8]:

Copied!





import numpy as np
import matplotlib.pyplot as plt

# Create a toy dataset
np.random.seed(42)
x = np.linspace(0, 10, 100)
y = 2 * x + 1 + np.random.normal(0, 2, len(x))

dataset = list(zip(x, y))

# Calculate the MAE and MSE for various splits
mae_values = []
mse_values = []
splits = np.arange(1, len(dataset))

for split_idx in splits:
    split1 = dataset[:split_idx]
    split2 = dataset[split_idx:]
    mae_values.append(mean_absolute_error([split1, split2]))
    mse_values.append(mean_squared_error([split1, split2]))

# Plot the MAE and MSE values
plt.plot(splits, mae_values, color='blue', label='MAE')
plt.plot(splits, mse_values, color='red', label='MSE')
plt.xlabel('Split Index')
plt.ylabel('Error Value')
plt.legend()
plt.title('MAE vs. MSE for Various Splits')
plt.show()
import numpy as np
import matplotlib.pyplot as plt

# Create a toy dataset
np.random.seed(42)
x = np.linspace(0, 10, 100)
y = 2 * x + 1 + np.random.normal(0, 2, len(x))

dataset = list(zip(x, y))

# Calculate the MAE and MSE for various splits
mae_values = []
mse_values = []
splits = np.arange(1, len(dataset))

for split_idx in splits:
    split1 = dataset[:split_idx]
    split2 = dataset[split_idx:]
    mae_values.append(mean_absolute_error([split1, split2]))
    mse_values.append(mean_squared_error([split1, split2]))

# Plot the MAE and MSE values
plt.plot(splits, mae_values, color='blue', label='MAE')
plt.plot(splits, mse_values, color='red', label='MSE')
plt.xlabel('Split Index')
plt.ylabel('Error Value')
plt.legend()
plt.title('MAE vs. MSE for Various Splits')
plt.show()

No description has been provided for this image

This code generates a toy dataset of points along a line with some random noise. It then calculates the MAE and MSE for various splits and plots the error values on a single graph with two distinct colors. The graph demonstrates how MAE and MSE values change depending on the chosen split.

The main difference between MAE and MSE is how they penalize large errors. MSE squares the errors, which makes larger errors more prominent in the final error value. In contrast, MAE treats all errors linearly. As a result, MSE is more sensitive to outliers, while MAE is more robust to them.

The choice between MAE and MSE depends on the problem at hand and whether it's more important to penalize large errors or maintain robustness to outliers.

Example on a toy dataset¶

Let's start by coding a small example in order to understand how a tree works in the context of regression: estimating a sine function.

If the word sine reminds you of bad memories of school, know that the sinusoidal functions are far from being abstract notions, they are everywhere in what surrounds us like the movement of waves, the simulation of vibrations to test the resistance of materials and many more things 🤓

Data Simulation¶

For this example, we simulate a sine function with random and independent noises.

In [11]:

Copied!





import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor

In [12]:

Copied!





# construisons notre dataset exemple 
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
# on ajoute du bruit afin d'avoir de meilleurs données 
y[::5] += 3 * (0.5 - rng.rand(16))
# construisons notre dataset exemple 
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
# on ajoute du bruit afin d'avoir de meilleurs données 
y[::5] += 3 * (0.5 - rng.rand(16))

The importance of the variation of a hyperparameter in the regression trees¶

As seen previously, here we are looking to estimate our target variable Y (a sine function) with data X 🤓

The first thing we will start with is understanding how a hyperparameter works in the case of a regression tree. A hyperparameter is a parameter whose value is used to control the learning process.

To fully understand this notion, we are going to create 4 regression trees with a different depth (our hyperparameter), respectively 2, 3, 4 and 6. For each of the trees:

We create the scikit-learn object allowing us to perform a regression on our data with a decision tree
We optimize (fit) the tree on our data
We predict (predict) the tree on simulated data

In [13]:

Copied!





# Fit regression model
regressors = []
depths = [2, 3, 4, 6]
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
Y_test = []
for p in depths:
    reg = DecisionTreeRegressor(max_depth=p)  # tree creation of deepth p
    reg.fit(X, y)  # fit the tree on the data
    Y_test.append(reg.predict(X_test))  # predict the data
    regressors.append(reg)
# Fit regression model
regressors = []
depths = [2, 3, 4, 6]
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
Y_test = []
for p in depths:
    reg = DecisionTreeRegressor(max_depth=p)  # tree creation of deepth p
    reg.fit(X, y)  # fit the tree on the data
    Y_test.append(reg.predict(X_test))  # predict the data
    regressors.append(reg)

Finally, we can display the input data and the predicted values for each of our trees.

In [14]:

Copied!





#plot option
plt.figure(figsize=(16, 10))
colors = [ "darkgray", "cornflowerblue", "yellowgreen", "lightcoral"]

#for loop to print all the trees
for i, reg in zip(range(len(regressors)), regressors):
    #define the figure inside the subplot 
    plt.subplot(220 + i + 1)
    plt.plot(X_test, Y_test[i], color=colors[i], label="Profondeur {0}".format(depths[i]), linewidth=2)
    #plot the data in the figure
    plt.scatter(X, y, s=30, c="darkorange", label="data")
    #plot otpions 
    plt.xlabel("Valeur explicative")
    plt.ylabel("Valeur réponse")
    plt.title("Visualisation de l'arbre de régression")
    plt.legend()
#save the figure     
plt.savefig("deep_illustration.png")
plt.show()
#plot option
plt.figure(figsize=(16, 10))
colors = [ "darkgray", "cornflowerblue", "yellowgreen", "lightcoral"]

#for loop to print all the trees
for i, reg in zip(range(len(regressors)), regressors):
    #define the figure inside the subplot 
    plt.subplot(220 + i + 1)
    plt.plot(X_test, Y_test[i], color=colors[i], label="Profondeur {0}".format(depths[i]), linewidth=2)
    #plot the data in the figure
    plt.scatter(X, y, s=30, c="darkorange", label="data")
    #plot otpions 
    plt.xlabel("Valeur explicative")
    plt.ylabel("Valeur réponse")
    plt.title("Visualisation de l'arbre de régression")
    plt.legend()
#save the figure     
plt.savefig("deep_illustration.png")
plt.show()

More technical details about the CART algorithm¶

We only consider the tree of depth $d=2$. If $d=2$, then there are $3$ splits (or cuts), denoted $t_1, t_2$ and $t_3$. Let's start at the beginning:

Initially, there is no cut. We therefore wonder how to calculate $t_1$, to obtain the rule $X \leq t_1$. To do this, we must introduce a loss metric, that is to say a function that tells us if the cut is good or not. Just like for linear regression, we want to consider the least squares distance between the theoretical value $y$ and the predicted value $\hat{y}$. We therefore want to cut so as to minimize this quadratic loss to the left of $t_1$, but also to the right of $t_1$.

Let's be more formal. Call $R_G = \{ X : X \leq t_1 \}$ the set to the left of $t_1$ and $R_D=\{ X : X > t_1 \}$ the set to the right of $t_1$. It has been said that we want to minimize the quadratic loss, i.e. to solve

$$\min_{t_1 \in \mathbb{R}} \sum_{i=1}^n (y_i-\hat{y})$$

Problem: $t_1$ does not enter the sum! Since this is a split, one must both consider when the point is either in $R_G$ or in $R_D$. Moreover, since we want to assign a weight to each node, which here will define the predicted value $\hat{y}$, we must introduce $c_G$ and $c_D$, which we interpret like this:

All the points $X$ which belong to $R_G$ (i.e. less than $t_1$) will have the predicted value $c_G$.
All the points $X$ which belong to $R_D$ (ie strictly greater than $t_1$) will have the predicted value $c_D$.

But these weights must also be optimized in order to be calibrated according to our data. From then on, we obtain a formula much more applicable numerically:

$$\min_{t_1 \in \mathbb{R}} \left(\min_{c_G \in \mathbb{R}} \sum_{x_i \in R_G} (y_i-c_G)^2 +\min_{c_D \ in \mathbb{R}} \sum_{x_i \in R_D} (y_i-c_D)^2 \right)$$

We can simplify the problem as:

$$\min_{t_1 \in \mathbb{R}} \left( \sum_{x_i \in R_G} (y_i-\hat{c_G})^2 + \sum_{x_i \in R_D} (y_i-\hat {c_D})^2 \right)$$

Questions 🤔¶

Compare the graphs obtained and deduce the role of the depth hyperparameter max_depth?
Explain the phenomenon of underfitting using the graphs above, for what value of max_depth our algorithm is underfitting?
Explain the phenomenon of overfitting using the graphs above for what value of max_depth our algorithm is overfitting?
Finally, what is the best value for the depth hyperparameter max_depth?

In [ ]:

CART cooding from scratch¶

The objective of this project is to build a custom CART Decision Tree Regressor from scratch without relying on external libraries like Scikit-learn 😃

The regressor will be implemented in Python and should be able to handle simple regression tasks.

Project Outline¶

Understand the CART algorithm and its components.
Implement the DecisionTreeRegressor class.
Write functions for splitting the data and calculating the errors.
Visualize the performance of the custom regressor on a toy dataset.

First let's define a DecisionTreeRegressor class, which has an optional parameter max_depth that determines the maximum depth of the decision tree. If max_depth is not specified, the tree can grow without depth restrictions.

In [15]:

Copied!

class DecisionTreeRegressor:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
class DecisionTreeRegressor:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

Now let's create a Node class. The Node class represents a node in the decision tree. It has a constructor __init__(self, left=None, right=None, value=None, split_feature=None, split_value=None) which initializes the node object with left and right child nodes, the node's value, and the split feature and value (for non-leaf nodes).

In [16]:

Copied!





class Node:
    def __init__(self, left=None, right=None, value=None, split_feature=None, split_value=None):
        self.left = left
        self.right = right
        self.value = value
        self.split_feature = split_feature
        self.split_value = split_value
class Node:
    def __init__(self, left=None, right=None, value=None, split_feature=None, split_value=None):
        self.left = left
        self.right = right
        self.value = value
        self.split_feature = split_feature
        self.split_value = split_value

Now let's define some important concepts :

The fit(self, X, y) takes the input features (X) and the target variable (y) as arguments, and builds the decision tree by calling the _build_tree helper function.
The predict(self, X) method takes an array of input features (X) and returns the predicted target variable for each input by calling the _predict_single helper function.
The _build_tree(self, X, y, depth) helper function is a recursive function that builds the decision tree by finding the best split for the data and creating child nodes until the stopping criteria (maximum depth) are met. It takes the input features X, target variable y, and the current depth of the tree. If the current depth exceeds max_depth or there is no optimal feature to split on, it creates a leaf node with the mean value of y. Otherwise, it calculates the best feature and value to split the data, creates left and right child nodes, and recurses on both subsets of the data.
The _find_best_split(self, X, y) helper function finds the best feature and value to split the data on by minimizing the mean squared error (MSE). It iterates through all possible features and split values, calculates the MSE for each split, and returns the best feature and split value that result in the lowest MSE.
The _predict_single(self, x) helper function takes a single input feature vector x and predicts the target variable for it. It traverses the decision tree from the root, following the left or right child nodes based on the split feature and value, until it reaches a leaf node. The value of the leaf node is the predicted target variable.

In [25]:

Copied!





#entier code 
import numpy as np

# Define the main DecisionTreeRegressor class
class DecisionTreeRegressor:
    # Class constructor with an optional parameter `max_depth` for the maximum depth of the tree
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

    # Define the inner Node class representing a node in the decision tree
    class Node:
        # Node class constructor with parameters for the left and right child nodes, value, split feature, and split value
        def __init__(self, left=None, right=None, value=None, split_feature=None, split_value=None):
            self.left = left
            self.right = right
            self.value = value
            self.split_feature = split_feature
            self.split_value = split_value

    # `fit` method takes the input features (X) and target variable (y) and builds the decision tree
    def fit(self, X, y):
        self.root = self._build_tree(X, y, 0)

    # `predict` method takes input features (X) and returns predictions for each input
    def predict(self, X):
        return [self._predict_single(x) for x in X]

    # Recursive helper function `_build_tree` for building the decision tree
    def _build_tree(self, X, y, depth):
        if len(y) == 0:
            return None

        if self.max_depth is not None and depth >= self.max_depth:
            return self.Node(value=np.mean(y))

        split_feature, split_value = self._find_best_split(X, y)
        if split_feature is None:
            return self.Node(value=np.mean(y))

        left_mask = X[:, split_feature] <= split_value
        right_mask = ~left_mask

        left_node = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_node = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return self.Node(left=left_node, right=right_node, split_feature=split_feature, split_value=split_value)

    # Helper function `_find_best_split` to find the best feature and value to split the data
    def _find_best_split(self, X, y):
        best_mse = float("inf")
        best_split_feature = None
        best_split_value = None

        for feature in range(X.shape[1]):
            feature_values = X[:, feature]
            for value in feature_values:
                left_mask = feature_values <= value
                right_mask = ~left_mask

                left_y = y[left_mask]
                right_y = y[right_mask]

                if len(left_y) <= 1 or len(right_y) <= 1:
                    continue

                mse = len(left_y) * np.var(left_y) + len(right_y) * np.var(right_y)
                if mse < best_mse:
                    best_mse = mse
                    best_split_feature = feature
                    best_split_value = value

        return best_split_feature, best_split_value

    # Helper function `_predict_single` for predicting the target variable for a single input feature vector
    def _predict_single(self, x):
        node = self.root
        while node.left is not None and node.right is not None:
            if x[node.split_feature] <= node.split_value:
                node = node.left
            else:
                node = node.right

        return node.value
#entier code 
import numpy as np

# Define the main DecisionTreeRegressor class
class DecisionTreeRegressor:
    # Class constructor with an optional parameter `max_depth` for the maximum depth of the tree
    def __init__(self, max_depth=None):
        self.max_depth = max_depth

    # Define the inner Node class representing a node in the decision tree
    class Node:
        # Node class constructor with parameters for the left and right child nodes, value, split feature, and split value
        def __init__(self, left=None, right=None, value=None, split_feature=None, split_value=None):
            self.left = left
            self.right = right
            self.value = value
            self.split_feature = split_feature
            self.split_value = split_value

    # `fit` method takes the input features (X) and target variable (y) and builds the decision tree
    def fit(self, X, y):
        self.root = self._build_tree(X, y, 0)

    # `predict` method takes input features (X) and returns predictions for each input
    def predict(self, X):
        return [self._predict_single(x) for x in X]

    # Recursive helper function `_build_tree` for building the decision tree
    def _build_tree(self, X, y, depth):
        if len(y) == 0:
            return None

        if self.max_depth is not None and depth >= self.max_depth:
            return self.Node(value=np.mean(y))

        split_feature, split_value = self._find_best_split(X, y)
        if split_feature is None:
            return self.Node(value=np.mean(y))

        left_mask = X[:, split_feature] <= split_value
        right_mask = ~left_mask

        left_node = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_node = self._build_tree(X[right_mask], y[right_mask], depth + 1)

        return self.Node(left=left_node, right=right_node, split_feature=split_feature, split_value=split_value)

    # Helper function `_find_best_split` to find the best feature and value to split the data
    def _find_best_split(self, X, y):
        best_mse = float("inf")
        best_split_feature = None
        best_split_value = None

        for feature in range(X.shape[1]):
            feature_values = X[:, feature]
            for value in feature_values:
                left_mask = feature_values <= value
                right_mask = ~left_mask

                left_y = y[left_mask]
                right_y = y[right_mask]

                if len(left_y) <= 1 or len(right_y) <= 1:
                    continue

                mse = len(left_y) * np.var(left_y) + len(right_y) * np.var(right_y)
                if mse < best_mse:
                    best_mse = mse
                    best_split_feature = feature
                    best_split_value = value

        return best_split_feature, best_split_value

    # Helper function `_predict_single` for predicting the target variable for a single input feature vector
    def _predict_single(self, x):
        node = self.root
        while node.left is not None and node.right is not None:
            if x[node.split_feature] <= node.split_value:
                node = node.left
            else:
                node = node.right

        return node.value

Now let's generate our toy sample dataset then apply our DecisionTreeRegressor class :

In [27]:

Copied!





import matplotlib.pyplot as plt

np.random.seed(42)
X = np.linspace(0, 10, 100)[:, np.newaxis]
y = 2 * X.ravel() + 1 + np.random.normal(0, 2, len(X))

X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]

regressors = []
depths = [2, 3, 4, 6]
Y_test = []
for p in depths:
    reg = DecisionTreeRegressor(max_depth=p)
    reg.fit(X, y)
    Y_test.append(reg.predict(X_test))
    regressors.append(reg)

plt.figure(figsize=(16, 10))
colors = ["darkgray", "cornflowerblue", "yellowgreen", "lightcoral"]

# for loop to print all the trees
for i, reg in zip(range(len(regressors)), regressors):
    # define the figure inside the subplot
    plt.subplot(220 + i + 1)
    plt.plot(X_test, Y_test[i], color=colors[i], label="Deep: {0}".format(depths[i]), linewidth=2)
    # plot the data in the figure
    plt.scatter(X, y, s=30, c="darkorange", label="data")
    # plot options
    plt.xlabel("Feature")
    plt.ylabel("Target")
    plt.title("Tree visualization")
    plt.legend()

# save the figure
plt.savefig("deep_illustration.png")
plt.show()
import matplotlib.pyplot as plt

np.random.seed(42)
X = np.linspace(0, 10, 100)[:, np.newaxis]
y = 2 * X.ravel() + 1 + np.random.normal(0, 2, len(X))

X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]

regressors = []
depths = [2, 3, 4, 6]
Y_test = []
for p in depths:
    reg = DecisionTreeRegressor(max_depth=p)
    reg.fit(X, y)
    Y_test.append(reg.predict(X_test))
    regressors.append(reg)

plt.figure(figsize=(16, 10))
colors = ["darkgray", "cornflowerblue", "yellowgreen", "lightcoral"]

# for loop to print all the trees
for i, reg in zip(range(len(regressors)), regressors):
    # define the figure inside the subplot
    plt.subplot(220 + i + 1)
    plt.plot(X_test, Y_test[i], color=colors[i], label="Deep: {0}".format(depths[i]), linewidth=2)
    # plot the data in the figure
    plt.scatter(X, y, s=30, c="darkorange", label="data")
    # plot options
    plt.xlabel("Feature")
    plt.ylabel("Target")
    plt.title("Tree visualization")
    plt.legend()

# save the figure
plt.savefig("deep_illustration.png")
plt.show()