| |

How Normalization Affects Random Forest Algorithm

data normalization in machine learning
 

Recently, I was implementing a Random Forest regressor when I faced the classical question: Should I implement data normalization?

Before going into the depth of the topic, we will try to understand what normalization is.

Normalization

The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. Often, raw data consists of attributes with varying scales. If we have a feature within larger ranges it will influence the model but that doesn’t mean it is more important as a predictor. So we normalize the data to bring all the variables to the same range. Also, the normalization helps to accelerate the optimization of the model.

For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.

Where do we use normalization?

All algorithms that are distance based require scaling. This includes all curve fitting algorithms (linear/non-linear regressions), logistic regression, KNN, SVM, Neural Networks, clustering algorithms like k-means clustering, etc.

Algorithms that are used for matrix factorization, decomposition, or dimensionality reduction like PCA, SVD, Factorization Machines, etc also require normalization.

Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not be affected by any monotonic transformations of the variables. Scaling is a monotonic transformation – the relative order of smaller to a larger value in a variable is maintained post the scaling. Examples of algorithms in this category are all the tree-based algorithms – CART, Random Forests, Gradient Boosted Decision Trees, etc. These algorithms utilize rules (series of inequalities) and do not require normalization.

Also, Algorithms that rely on distributions of the variables, like Naive Bayes also do not need scaling.

Tree-based algorithms like decision trees and random forests do not represent data on a multidimensional plane. They work on an “If-then” principle where the algorithm asks certain questions to the data, if the answer is yes then a result is assigned and if the answer is no then some other result is assigned. In such algorithms, changing the scale will not influence them to increase performance or accuracy. You don’t get any analog of a regression coefficient, which measures the relationship between each predictor variable and the response. Because of this, you also don’t need to consider how to interpret such coefficients, which are affected by variable measurement scales.

As of now, we know that decision tree-based algorithms do not need, in general, normalization. But, I realized that every time I heard this statement it was related to classifiers so, what about regressors? Do they need data normalization?

However, after deepening my research, I discovered that data normalization may slightly affect the output. So, here I leave my takeaway and a short demonstration in Python.

Takeaway

For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.

Therefore, data normalization won’t affect the output for Random Forest classifiers while it will affect the output for Random Forest regressors. Regarding the regressor, the algorithm will be more affected by the high-end values if the data is not transformed. This means that they will probably be more accurate in predicting high values than low values. Consequently, transformations such as log-transform will reduce the relative importance of these high values, hence generalizing better.

Experiment

Here I present a short experiment that shows the output changes for regressors when implementing data normalization but is the same for classifiers.

  • Importing the libraries
  • Loading the dataset

The dataset selected is the Boston House-prices dataset, used for regression tasks.

 
Keys: dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Feature names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Data shape: (506, 13)Target shape: (506,)

To use this dataset also for classification, the target variable has been rounded.

 
Fig 1: Box Plot of the target Variable
 
From Fig 1, we can spot some outliers, the ones who go beyond the fourth Quartile. 
 

Random Forest classifier

Random forest classifiers set up decision trees based on a random selection of training data. Basically, it collects the votes from different decision trees (DTs) of a randomly selected subset of the training set, and then makes the final prediction based on those votes.

The process of implementation is shown in the below image: 

The architecture of the Random Forest Classifier. | Download Scientific  Diagram

Here is the implementation of the Random Forest classifier under three conditions: (1) no normalization, (2) min-max normalization, and (3) standardization.

As observed, the data normalization does not affect the accuracy score.

clf = RandomForestClassifier(random_state=16)

# Classification
X_train, X_test, y_train, y_test = train_test_split(X, y_round, test_size=0.33, random_state=16)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy score without normalization: {metrics.accuracy_score(y_test, y_pred)}')

## Min-max normalization
sc = MinMaxScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)
clf.fit(X_train_norm, y_train)
y_pred = clf.predict(X_test_norm)
print(f'Accuracy score with min-max normalization: {metrics.accuracy_score(y_test, y_pred)}')

## Standardization
sc = StandardScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)
clf.fit(X_train_norm, y_train)
y_pred = clf.predict(X_test_norm)
print(f'Accuracy score with standarization: {metrics.accuracy_score(y_test, y_pred)}')
Accuracy score without normalization: 0.12574850299401197
Accuracy score with min-max normalization: 0.12574850299401197
Accuracy score with standardization: 0.12574850299401197

Random Forest regressor

 It is true that each decision tree has a high variance, but when we combine all of them in parallel, the resultant variance is low since each decision tree is trained perfectly on the particular sample data, so the output does not depend on one decision tree but on multiple decision trees. Using a majority voting classifier, the final output is calculated in a classification problem. When a regression problem occurs, the final result is the mean of all the results. This part is called Aggregation.

The process of implementation is shown in the below image:

Random Forest Regression. Random Forest Regression is a… | by Chaya Bakshi  | Level Up Coding 

Here is the implementation of the Random Forest regressor under three conditions: (1) no normalization, (2) min-max normalization, and (3) standardization.

In this case, data normalization affects the mean squared score of the regressor.

clf = RandomForestRegressor(random_state=16)

# Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=16)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f'Accuracy score without normalization: {metrics.mean_squared_error(y_test, y_pred)}')

## Min-max normalization
sc = MinMaxScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)
clf.fit(X_train_norm, y_train)
y_pred = clf.predict(X_test_norm)
print(f'Accuracy score with min-max normalization: {metrics.mean_squared_error(y_test, y_pred)}')
      
## Standardization
sc = StandardScaler()
X_train_norm = sc.fit_transform(X_train)
X_test_norm = sc.transform(X_test)
clf.fit(X_train_norm, y_train)
y_pred = clf.predict(X_test_norm)
print(f'Accuracy score with standardization: {metrics.mean_squared_error(y_test, y_pred)}')
 
Mean Squared score without normalization: 13.38962275449102
Mean Squared score with min-max normalization: 13.478456820359284
Mean Squared score with standardization: 13.38586179640719

From the above two cases, we can say that the Mean Squared score for Classification problems is not affected by Normalization, whereas for the Regression problems there is a minute change in the Mean Squared score, which is almost negligible.

References:

You can also use these references for more information:

Similar Posts

Leave a Reply

Your email address will not be published.