Dimensionality Reduction


Before jumping into dimensionality reduction, let’s first define what a dimension is. Given a matrix A, the dimension of the matrix is the number of rows by the number of columns.
If A has 3 rows and 5 columns, A would be a 3×5 matrix.
“Dimensionality” simply refers to the number of features/variables in your dataset.”
A = [1, 2, 3] –> 1 row, 3 columns
The dimension of A is 1×3
Curse of dimensionality
It is a phenomenon which occurs in high dimensional space that hardly occurs in lower dimensional space. Due to the higher number of dimensions, the model gets sparse.
Higher-dimensional space causes problems in clustering (becomes very difficult to separate one cluster data from another), search space also increases, the complexity of the model increases.
So, what is the curse of dimensionality?
Let us understand this phenomenon by a simple ‘Jay and his dog farm’ example.
As in figure 1 : Jay’s only task is to find all dogs (feature1) which is very easy as he has to walk in a straight line to pick all his dogs here data density is also very high hence he will not face any difficulty to trace all the dogs.
Suppose in the second scenario we are adding another feature space such that now his task is to find those dogs (feature 1) which belong to his farm(feature 2) only. Here he will take a little more time as compared to figure 1.
Now in the third scenario, his task is to find only those dogs (feature1) which belong to his farm (feature2) and breed is Labrador (adding one more feature space). Here the number of dimensions increased to three and data density also decreased hence his task becomes more difficult.
So, if we add another dimension, say color, region, diet, health it becomes more and more difficult for him to find his dog. So, we can say that by increasing the number of features data density decreases and complexity increases and it became very difficult for the machine learning models to work efficiently.
To overcome the curse of dimensionality, dimensionality reduction comes into the picture, it is the reduction of high dimensional space to lower dimensional space such that it becomes easy to visualize the dataset.
There are various dimensionality reduction methods we are going to see some of them:
- PCA
- Reduce multicollinearity: variance inflation factor