Python Series 3: Feature Scaling for Machine Learning (Normalization Vs Standardization)

by: Robby Alfardo Irfan, Learner by doing

Photo by Elena Mozhvilo on Unsplash

When we want to build our models, our data must be prepared first. Our preprocessed data may contain attributes with a mixture of scales for various quantities such as dollars, kilograms, and sales volume. It has multiple features spanning varying degrees of magnitude, range, and units. This is a significant obstacle as a few machine learning algorithms are highly sensitive to these features.

Feature Scaling is an essential step in the data analysis and preparation of data for modeling. Wherein, we make the data scale-free for easy analysis. it improves the performance of some machine learning algorithms significantly.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Okay, let’s learn about normalization & Standardization in this series!

Why Should We Use Feature Scaling?

Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it. Alright, we’ll take a look at why it becomes sensitive.

Gradient Descent Based Algorithms

There are some machine learning algorithms that use gradient descent as an optimization technique that requires data to be scaled, such as linear regression, logistic regression, neural network, etc.

Here is the formula of gradient descent:

Gradient descent formula

The step size of the gradient descent will be affected by the presence of feature value X in that formula. The variety in the range of features will cause different step sizes for every feature. Therefore, we scale the data on a similar scale before feeding it to the model to ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features.

Distance-Based Algorithms

Distance-based algorithms are the most affected by the range of features, such as KNN, K-means, Support Vector Machine, etc. Because the core of their formula using the distance between data points to determine their similarity.

Here is the formula of distance-based using Euclidean distance:

Euclidean distance formula

There will be a chance that higher weightage is given to features with higher magnitude if the features have different scales. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.

Tree-Based Algorithms

On the other hand, tree-based algorithms are not sensitive to the different scales of the features as the two algorithms before. Because a decision tree just splitting a node based on a single feature and is not influenced by other features. Therefore, tree-based algorithms are not affected by the different range of features.

Tree-based. Credit:

Introducing Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

There are two primary ways for feature scaling which we will cover in the remainder of this article:

Data Normalization

Normalization refers to rescaling real-valued numeric attributes into the range 0 and 1. It is also known as Min-Max scaling.

The formula of normalization using Min-Max scaling:

Normalization formula

Here, Xmin and Xmax are the minimum and the maximum values of the feature respectively.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is also known as Z-score scaling.

The formula of standardization using Z-score scaling:

Standardization formula

Miu is the mean of the feature values and sigma is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

Tips: Which Method To Use

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

However, which method to use depending on our problem and the machine learning algorithm we are using. You can always start by fitting your model to raw, normalized, and standardized data and compare the performance for the best results.

Implementing Feature Scaling

Loading The Dataset

We will use Diabetes data from Kaggle. It was created in 2017. Additionally, variables with NaN values have been handled using Imputation (Mean and Median). Let’s begin by reading our data as a pandas DataFrame:

##Data Inputing and Describingimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
import pandas as pd
diabetes_data = pd.read_csv('.../input/diabetes.csv')
The Image shows the info of the dataset
The sample of dataset

There are 768 observations with 8 variables. We can see from the data type of each column what type of variable it is. All of the variables that we will use are numeric variables.

Describing the dataset

We can see that there is a huge difference in the range of values present in our numerical features. We can easily notice Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age don’t have the same scale and this will cause some issues in our machine learning model.

Splitting Our Dataset into Training and Test Set

Well, here it’s our algorithm model that is going to learn from your data to make predictions. Generally, we split the data-set into 70:30 ratio or 80:20 what does it mean, 70 percent data take in train and 30 percent data take in the test. However, this Splitting can be varied according to the dataset shape and size.

#Splitting training and testing data
diabetes_target = diabetes_data.Outcome
diabetes_predictors = diabetes_data.drop(['Outcome'], axis=1)
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes_predictors,diabetes_target,test_size=0.2,random_state=0)

How Normalization Works

When we need to normalize your data, we should import the MinMaxScalar from the sklearn library.

#Data Normalization
from sklearn.preprocessing import MinMaxScaler
normalization = MinMaxScaler().fit(X_train)
X_train_norm = pd.DataFrame(normalization.transform(X_train), columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
X_test_norm = pd.DataFrame(normalization.transform(X_test), columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])

Let’s see how normalization has affected our dataset:

Training data after normalization

How Standardization Works

#Data Standardization
from sklearn.preprocessing import StandardScaler
standardization = StandardScaler()
X_train_std = pd.DataFrame(standardization.fit_transform(X_train),
columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])
X_test_std = pd.DataFrame(standardization.fit_transform(X_test),
columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age'])

Let’s see how standardization has affected our dataset:

Training data after standardization

Comparing original, normalized, and standardized data

We can understand the distribution more by visualizing our data first. We can see the comparison between our unscaled and scaled data using boxplots.

#Comparing BoxPlotfig, axes = plt.subplots(3, 1, figsize=(12,12))
sns.boxplot(x="variable", y="value", data=pd.melt(X_train), ax=axes[0]).set(
title='Original Data', xlabel='')
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_norm), ax=axes[1]).set(
title='Normalized Data', xlabel='')
sns.boxplot(x="variable", y="value", data=pd.melt(X_train_std), ax=axes[2]).set(
title='Standardized Data', xlabel='')
Boxplots comparison between scaled and unscaled data

We can notice how scaling the features brings everything into perspective. Based on the graph above, our normalized data spread in the range 0 and 1 and our standardized data has a mean of 0 and a standard deviation of 1. The features are now more comparable and will have a similar effect on our learning models.

Applying Scaling to Machine Learning Algorithms

It’s now time to train some machine learning algorithms on our data to compare the effects of different scaling techniques on the performance of the algorithm. I want to see the effect of scaling on three algorithms in particular: K-Nearest Neighbours, Support Vector Machine, and Decision Tree.

Note: We are measuring the RMSE here because this competition evaluates the RMSE.

K-Nearest Neighbours (KNN)

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. This is the simplest case. Suppose P1 is the point, for which the label needs to predict. First, you find the closest point to P1 and then the label of the nearest point assigned to P1.

KNN works from

Each object votes for their class and the class with the most votes is taken as the prediction. KNN has the following basic steps:

  1. Calculate distance
  2. Find closest neighbors
  3. Vote for labels
#KNN model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
knn = KNeighborsClassifier(n_neighbors=6)
rmse_knn = []
trainX = [X_train, X_train_norm, X_train_std]
testX = [X_test, X_test_norm, X_test_std]
for i in range(len(trainX)):[i],y_train)
pred = knn.predict(testX[i])
df_knn = pd.DataFrame({'RMSE':rmse_knn},index=['Original','Normalized','Standardized'])
RMSE from K-Nearest Neighbor

You can see that scaling the features has brought down the RMSE score of our KNN model. Specifically, the normalized data perform a tad bit better than the standardized data.

Support Vector Machine (SVM)

Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. SVM constructs a hyperplane in multidimensional space to separate different classes. The core idea of SVM is to find a maximum marginal hyperplane (MMH) that best divides the dataset into classes.

SVM works from

SVM searches for the maximum marginal hyperplane in the following steps:

  1. Generate hyperplanes that segregate the classes in the best way. Left-hand side figure showing three hyperplanes black, blue, and orange. Here, the blue and orange have higher classification errors, but the black is separating the two classes correctly.
  2. Select the right hyperplane with the maximum segregation from the either nearest data points as shown in the right-hand side figure.
#SVM model
from sklearn import svm
from sklearn.metrics import mean_squared_error
svm = svm.SVC(kernel='linear')
rmse_svm = []
trainX = [X_train, X_train_norm, X_train_std]
testX = [X_test, X_test_norm, X_test_std]
for i in range(len(trainX)):[i],y_train)
pred = svm.predict(testX[i])
df_svm = pd.DataFrame({'RMSE':rmse_svm},index=['Original','Normalized','Standardized'])
RMSE from Support Vector Machine

The standardized data has performed better than the normalized data.

Decision Tree

Decision Tree is one of the easiest and popular classification algorithms to understand and interpret. It can be utilized for both classification and regression kinds of problems.

Decision tree works from

A decision tree is a flowchart-like tree structure where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. This flowchart-like structure helps you in decision-making.

#Decision Tree Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
dt = DecisionTreeClassifier()
rmse_dt = []
trainX = [X_train,X_train_norm,X_train_std]
testX = [X_test,X_test_norm,X_test_std]
for i in range(len(trainX)):[i],y_train)
pred = dt.predict(testX[i])
df_dt = pd.DataFrame({'RMSE':rmse_dt},index=['Original','Normalized','Standardized'])
RMSE from decision tree

You can see that the RMSE score has not moved an inch on scaling the features. So rest assured when you are using tree-based algorithms on your data!


I hope you’ve enjoyed this brief tutorial on scaling data using normalization and standardization that have varying effects on the working of machine learning algorithms!

Keep in mind that there is no correct answer to when to use normalization over standardization and vice-versa. It all depends on your data and the algorithm you are using.

See ya in the next series!!!

Passionate to always learn more about data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store