PCA & t-SNE¶
PCA and t-SNE are dimensionality reduction techniques that are widely used in machine learning and statistics !
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable (hard to control or deal with). More informations on wikipedia.
As you may knopw, in both Machine Learning and Statistics, to construct an efficient model, it's crucial to include only those features in the dataset that have significant interrelationships. This concept is known as dimensionality reduction, which essentially means minimizing the number of random variables we're considering by focusing on a core set of primary variables.
Dimensions challenge¶
Let's get into details on the challenge with data that has many dimensions :
- Computationally, it's expensive to process.
- There's a risk of overfitting the model, meaning it might excel on training data but falter on unseen data.
- In high-dimensional spaces, data points aren't typically distributed randomly and often exhibit misleading correlations.
- As dimensions increase, the difference in distances between the closest and farthest data points tends to diminish, potentially undermining tools that rely on these distances.
Why is reducing dimensionality crucial¶
It aids in addressing the aforementioned challenges while retaining the essential information needed to develop predictive models. Predictions often rely on numerous factors or features. A vast number of features can complicate data visualization and processing.
Many times, these features overlap in terms of the information they provide, making some of them redundant. This is where techniques to reduce dimensionality become invaluable.
- It conserves both time and storage.
- By eliminating multicollinearity, it enhances the clarity of machine learning model parameters.
- Data visualization becomes more straightforward when dimensions are reduced to simpler forms like 2D or 3D.
- It helps sidestep the pitfalls associated with high dimensionality.
- By filtering out non-essential features, it ensures models aren't misled by irrelevant data, thereby potentially boosting accuracy.
As you can see on the animation below, sometimes is more straightforward to see clusters when dimensions are reduced or increased 🤓
We can basically say dimensionality reduction is used to reduce the number of features in a dataset while retaining as much of the important information as possible.
Famous methods of Dimensionality Reduction¶
The most common methods used for dimensionality reduction include:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Generalized Discriminant Analysis (GDA)
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
Dimensionality reduction may be both linear and non-linear, depending upon the method used. The prime linear method, called Principal Component Analysis, or PCA, is discussed below. More informations on geeksforgeeks.
In this article we will be focusing only PCA and t-SNE and implementing them with 🐍
PCA¶
Principal Component Analysis (PCA) works by identifying the axes (principal components) in the dataset that maximize the variance of the data in order to preserve the "informations" of the data. By projecting the data on these axes (two first axis generally), we can reduce the number of dimensions while retaining as much information as possible.
If you are not very confortable with maths I recommend the fanstastic video from StatQuest
Mathematical overview¶
Given a dataset with $n$ samples and $p$ features, the goal of PCA is to find a set of $p′$ orthogonal vectors (where $p′<p$) that can be used as a basis to represent the dataset. These vectors are called principal components.
Standardization¶
For PCA to work properly like other ML algorithm, you should first standardize your data to have a mean of 0 and a variance of 1 for each feature. We can write :
$$X_{\text{std}} = \frac{X - \mu}{\sigma}$$
where :
- $X$ is the original column data
- $\mu$ is the mean of this column
- $\sigma$ is standard deviation of this column
Compute the Covariance Matrix¶
The covariance matrix captures the relationship between features. For a dataset with $p$ features, the covariance matrix is $p×p$ and it is written by this formula :
$$\Sigma = \frac{1}{n-1} X_{\text{std}}^T X_{\text{std}}$$
Compute the Eigenvectors and Eigenvalues¶
The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. In the context of PCA, the eigenvectors are called principal axes, and the length of these vectors is indicative of the amount of variance explained by that axis.
$$\Sigma v = \lambda v$$
where :
- $\Sigma$ is the coviriance matrix
- $v$ the eigenvector
- $\lambda$ the eigenvalue
Sort Eigenvectors by decreasing eigenvalues¶
The eigenvector with the highest eigenvalue is the first principal component. The second principal component (which is orthogonal to the first) corresponds to the eigenvector with the second highest eigenvalue, and so on.
Then choose the First $k$ eigenvectors and form a $d×k$ dimensional matrix $W$ where $d$ is the number of original dimensions and $k$ is the number of chosen principal components (eigenvectors).
Finally transform the original data to the new subspace using the matrix $W$ :
$$X_{\text{pca}} = X_{\text{std}} W$$
Python implementation¶
Let's begin by coding a simple case of 1 dimension with 🐍
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from sklearn import decomposition
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import numpy as np
def pca(X, num_components):
# 1. Standardize the data
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_normalized = (X - X_mean) / X_std
# 2. Compute the covariance matrix
covariance_matrix = np.cov(X_normalized, rowvar=False)
# 3. Compute the eigenvectors and eigenvalues
eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)
# 4. Sort eigenvectors by decreasing eigenvalues
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvectors_sorted = eigenvectors[:, sorted_indices]
eigenvalues_sorted = eigenvalues[sorted_indices]
# 5. Choose the top 'num_components' eigenvectors
eigenvectors_subset = eigenvectors_sorted[:, :num_components]
# 6. Transform the data
X_reduced = np.dot(X_normalized, eigenvectors_subset)
return X_reduced
# Sample dataset
data = np.array([
[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2, 1.6],
[1, 1.1],
[1.5, 1.6],
[1.1, 0.9]
])
# Visualize original data
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(data[:, 0], data[:, 1], color='blue', alpha=0.5)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# Apply PCA and reduce data to 1 dimension
data_reduced = pca(data, 1)
# Visualize reduced data
plt.subplot(1, 2, 2)
plt.scatter(data_reduced, [0] * len(data_reduced), color='red', alpha=0.5)
plt.title('Data after PCA')
plt.xlabel('Principal Component 1')
plt.yticks([]) # Hide y-axis as it's not meaningful in this 1D visualization
plt.tight_layout()
plt.show()
So, this is cool but let's using the appropriate PCA()
function from scikit-learn lib 😎
First let's generate and other toy dataset with the function make_blobs()
X1, Y1 = make_blobs(n_features=10,
n_samples=100,
centers=4, random_state=4,
cluster_std=2)
print(X1.shape)
plt.figure(figsize=(16,9))
sns.swarmplot(X1[:, 0], X1[:, 1],size=10)
(100, 10)
<AxesSubplot:>
# Define the number of components you want
pca = decomposition.PCA(n_components=4)
pc = pca.fit_transform(X1)
# Save it into a dataframe
pc_df = pd.DataFrame(data = pc ,
columns = ['PC1', 'PC2','PC3','PC4'])
pc_df['Cluster'] = Y1
pc_df.head()
PC1 | PC2 | PC3 | PC4 | Cluster | |
---|---|---|---|---|---|
0 | -8.133443 | -0.302138 | 9.984672 | 0.423878 | 2 |
1 | 18.931381 | 0.193838 | 0.045462 | -3.894042 | 1 |
2 | -6.571463 | -14.188490 | -3.120115 | 1.278588 | 0 |
3 | -7.533948 | 14.439427 | -5.443487 | 3.358252 | 3 |
4 | -4.591760 | -11.315284 | -9.130630 | -1.420151 | 0 |
# See the variance ratio per axis
pca.explained_variance_ratio_
array([0.41594854, 0.3391866 , 0.1600729 , 0.02016822])
df = pd.DataFrame({'var':pca.explained_variance_ratio_,
'PC':['PC1','PC2','PC3','PC4']})
sns.barplot(x='PC',y="var",
data=df, color="c");
# plot the data on the 2 first axis to see the effect of our decomposition
sns.lmplot( x="PC1", y="PC2",
data=pc_df,
fit_reg=False,
hue='Cluster',
legend=True,
scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x7f723b98ca00>
t-SNE¶
t-SNE is a non-linear dimensionality reduction technique particularly suitable for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. It works by minimizing the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding. ** It's especially useful for visualizing clusters or groups within data.**
Mathematical Foundations¶
Implementing t-SNE from scratch requires iterative optimization to minimize a certain cost function 😇
For brevity, we'll be focusing on the general idea and the steps:
- Compute pairwise affinities $p_{i,j}$
- Initialize the solution using PCA
- Use gradient descent to minimize the divergence, updating the low-dimensional representations $y_i$
- Compute pairwise affinities $q_{i,j}$ in the low-dimensional space.
- Compute the gradient of the Kullback-Leibler divergence.
- Update the solution using the computed gradient 😎
As you may understund it is not a simple few lines function ! I often recommend this project as a solid fundation 🦾
Computing Pairwise Affinities¶
For high-dimensional data, the similarity of datapoint $x_i$ to datapoint $x_j$ is given by a Gaussian distribution:
$$p_{j|i} = \frac{exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} exp(-||x_i - x_k||^2 / 2\sigma_i^2)} $$
Where :
- $p_{j|i}$ is the conditional probability that $x_i$ would pick $x_j$ as its neighbor
- $\sigma_i$ is the variance of the Gaussian that is centered on datapoint $x_i$
The pairwise affinities $p_{i,j}$ are then computed as :
$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$
Computing Pairwise Affinities $q_{i,j}$ in the Low-Dimensional Map¶
In the low-dimensional space, the pairwise affinities $q_{i,j}$ are computed using the following :
$$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}$$
Where $y_i$ and $y_j$ are the low-dimensional representations of datapoint $x_i$ and $x_j$.
Minimizing the Divergence¶
As you may known machine learning algorithms are always about minimizing or maximizing stuff 😂 In this case the t-SNE minimizes the divergence between $p_{i,j}$ and $q_{i,j}$ using the Kullback-Leibler divergence :
$$C = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$
Implementation with sklearn¶
We will know be using the t-SNE()
function of the scikit-learn package in 🐍
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# Visualize original data in the first two dimensions
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X1[:, 0], X1[:, 1], c=Y1, cmap='viridis', alpha=0.5)
plt.title('Original Data (First Two Dimensions)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X1_tsne = tsne.fit_transform(X1)
# Visualize t-SNE results
plt.subplot(1, 2, 2)
plt.scatter(X1_tsne[:, 0], X1_tsne[:, 1], c=Y1, cmap='viridis', alpha=0.5)
plt.title('Data after t-SNE')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.tight_layout()
plt.show()