Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of data.
It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality.
PCA technique is used for those datasets that are scaled.
Steps Involved in the PCA
Step 1: Standardize the dataset:
Standardization is all about scaling your data in such a way that all the variables and their values lie within a similar range.
Consider an example, let’s say that we have 2 variables in our data set, one has values ranging between 10-100 and the other has values between 1000-5000. In such a scenario, it is obvious that the output calculated by using these predictor variables is going to be biased since the variable with a larger range will have a more obvious impact on the outcome.
Therefore, standardizing the data into a comparable range is very important. Standardization is carried out by subtracting each value in the data from the mean and dividing it by the overall deviation in the data set.
Post this step, all the variables in the data are scaled across a standard and comparable scale.
Step 2: Computing the covariance matrix
This step aims to understand how the variables of the input data set are varying from the mean concerning each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, to identify these correlations, we compute the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables.
Step 3: Calculating the Eigenvectors and Eigenvalues
Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance matrix to determine the principal components of the data set.
But first, let’s understand more about principal components
What are Principal Components?
Simply put, principal components are the new set of variables that are obtained from the initial set of variables.
The principal components are computed in such a manner that newly obtained variables are highly significant and independent of each other.
The principal components compress and possess most of the useful information that was scattered among the initial variables.
If your data set is of 5 dimensions, then 5 principal components are computed, such that, the first principal component stores the maximum possible information and the second one stores the remaining maximum info and so on, you get the idea.
Step 4: Computing the Principal Components
Once we have computed the Eigenvectors and eigenvalues, all we have to do is order them in the descending order, where the eigenvector with the highest eigenvalue is the most significant and thus forms the first principal component. The principal components of lesser significance can thus be removed to reduce the dimensions of the data.
The final step in computing the Principal Components is to form a matrix known as the feature matrix that contains all the significant data variables that possess maximum information about the data.
Step 5: Reducing the dimensions of the data set
The last step in performing PCA is to rearrange the original data with the final principal components which represent the maximum and the most significant information of the data set.
To replace the original data axis with the newly formed Principal Components, you simply multiply the transpose of the original data set by the transpose of the obtained feature vector.
Organizing information in principal components this way will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.
So that was the theory behind the entire PCA process.
1] It’s time to get your hands dirty and perform all these steps by using a real data set: https://github.com/gskdhiman/Understanding-PCA/blob/master/PCA_code.ipynb
2] To perform each step mathematically, visit this article: https://www.gatevidyalay.com/tag/principal-component-analysis-numerical-example/