The z-score normalization is a feature scaling technique so let’s understand the feature scaling first Feature scaling is a most important data preprocessing step in machine learning. A lack of scaling leads to numerically larger distance values for algorithms that compute distances between features. It is fairly insensitive to the scale of the features when using tree-based algorithms. Machine learning and deep learning algorithms train and converge more quickly when features are scaled.Normalization and Standardization are some of the most popular and, at the same time, most confusing feature scaling techniques. Here, we discuss z-score normalization or standardization technique.
What is z-score normalization?
The letter ‘Z’ in z-score stands for Zeta (6th letter of the Greek alphabet) which comes from the Zeta Model that was originally developed by Edward Altman to estimate the chances of a public company going bankrupt. Also referred to as zero-mean Normalization. Z-Score helps in the normalization of data. If we normalize the data into a simpler form with the help of z score normalization, then it’s very easy to understand by our brains. It is a strategy of normalizing data that avoids this outlier issue. In this technique, values are normalized based on the mean and standard deviation of the data. The essence of this technique is the data transformation by the conversion of the values to a common scale where an average number/mean equals zero and a standard deviation is one. Technically, it measures the standard deviations below or above the mean. Standardization or z-score normalization does not get affected by outliers because there is no predefined range of transformed features.
A value is normalized under the formula We use the following formula to perform a z-score normalization on every value in a dataset:
- x: Original value
- μ: Mean of data
- σ: Standard deviation of data
A normal distribution is shown below and it is estimated that:
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation
99.7% of the data points lie between +/- 3 standard deviation
Interpreting Z-scores: The z-score is positive if the value lies above the mean, and negative if it lies below the mean.
Here are some important facts about z-scores:
- A positive z-score says the data point is above average.
- A negative z-score says the data point is below average.
- A z-score close to 0 says the data point is close to average.
- A data point can be considered unusual if its z-score is above 3 or below -3
Advantages of z-score noramlization:
- It allows a data administrator to understand the probability of a score occurring within the normal distribution of the data.
- The z-score enables a data administrator to compare two different scores that are from different normal distributions of the data.
Example: Suppose the scores for a certain exam are normally distributed with a mean of 80 and a standard deviation of 4. Find the z-score for an exam score of 87.
We can use the following steps to calculate the z-score:
- The mean is μ = 80
- The standard deviation is σ = 4
- The individual value we’re interested in is X = 87
Thus, z = (X – μ) / σ = (87 – 80) /4 = 1.75.
Using the SciPy library available in python we can calculate the z-score. SciPy library provides scipy.stats.zscore function to calculate z-score.
scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)
The following tutorials provide additional information on different normalization techniques: