**Python Series 3: Feature Scaling for Machine Learning (Normalization Vs Standardization)**

*by: Robby Alfardo Irfan, Learner by doing*

When we want to build our models, our data must **be prepared** first. Our preprocessed data may contain attributes with a mixture of scales for various quantities such as dollars, kilograms, and sales volume. It has multiple features spanning **varying degrees of magnitude, range, and units**. This is a significant obstacle as a few machine learning algorithms are highly sensitive to these features.

**Feature Scaling** is an essential step in the data analysis and preparation of data for modeling. Wherein, we make the **data scale-free for easy analysis**. it improves the performance of some machine learning algorithms significantly.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are **normalization** and **standardization**.

**Okay, let’s learn about normalization & Standardization in this series!**

# Why Should We Use Feature Scaling?

Some machine learning algorithms **are sensitive** to feature scaling while others are virtually invariant to it. Alright, we’ll take a look at why it becomes sensitive.

## Gradient Descent Based Algorithms

There are some machine learning algorithms that use gradient descent as an optimization technique that requires data to be scaled, such as **linear regression**, **logistic regression**, **neural network**, **etc**.

Here is the formula of gradient descent:

The step size of the gradient descent will be affected by **the presence of feature value X** in that formula. The variety in the range of features will cause different step sizes for every feature. Therefore, we **scale the data** on a similar scale before feeding it to the model to ensure that the gradient descent** moves smoothly towards the minima** and that the steps for gradient descent are updated **at the same rate** for all the features.

## Distance-Based Algorithms

Distance-based algorithms are the most affected by the range of features, such as** KNN, K-means, Support Vector Machine, etc**. Because the core of their formula using the **distance **between data points to determine their similarity.

Here is the formula of distance-based using Euclidean distance:

There will be a chance that **higher weightage** is given to features with **higher magnitude** if the features have different scales. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm **to be biassed** towards one feature.

## Tree-Based Algorithms

On the other hand, tree-based algorithms are** not sensitive **to the different scales of the features as the two algorithms before. Because a decision tree just splitting a node based on a single feature and is not influenced by other features. Therefore, tree-based algorithms **are not affected** by the different range of features.

# Introducing Feature Scaling

**Feature scaling** is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

There are two primary ways for feature scaling which we will cover in the remainder of this article:

## Data Normalization

Normalization refers to rescaling real-valued numeric attributes into the **range 0 and 1**. It is also known as **Min-Max scaling**.

The formula of normalization using Min-Max scaling:

Here, Xmin and Xmax are the minimum and the maximum values of the feature respectively.

## Data Standardization

Standardization refers to shifting the distribution of each attribute to have a **mean of zero** and **a standard deviation of one** (unit variance). It is also known as **Z-score scaling**.

The formula of standardization using Z-score scaling:

*Miu* is the mean of the feature values and s*igma* is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

# Tips: Which Method To Use

**Normalization** is good to use when you know that the distribution of your data **does not follow a Gaussian distribution**. This can be useful in algorithms that do not assume any distribution of the data like **K-Nearest Neighbors and Neural Networks.**

**Standardization**, on the other hand, can be helpful in cases where the data **follows a Gaussian distribution**. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

However, which method to use depending on our problem and the machine learning algorithm we are using. You can always start by **fitting your model to raw, normalized, and standardized** data and compare** the performance** for the best results.

# Implementing Feature Scaling

## Loading The Dataset

We will use Diabetes data from Kaggle. It was created in 2017. Additionally, variables with NaN values have been handled using Imputation (Mean and Median). Let’s begin by reading our data as a `pandas`

DataFrame:

##Data Inputing and Describingimport numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

import pandas as pddiabetes_data = pd.read_csv('.../input/diabetes.csv')diabetes_data.info()

diabetes_data.head()

diabetes_data.describe()

There are **768 observations with 8 variables**. We can see from the data type of each column what type of variable it is. All of the variables that we will use are numeric variables.

We can see that there is a huge difference in the range of values present in our numerical features. We can easily notice **Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Age** don’t have the same scale and this will cause some issues in our machine learning model.

## Splitting Our Dataset into Training and Test Set

Well, here it’s our algorithm model that is going to learn from your data to make predictions. Generally, we split the data-set into **70:30** ratio or **80:20** what does it mean, 70 percent data take in train and 30 percent data take in the test. However, this Splitting **can be varied** according to the dataset **shape and size.**

#Splitting training and testing data

diabetes_target = diabetes_data.Outcome

diabetes_predictors = diabetes_data.drop(['Outcome'], axis=1)import numpy as np

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(diabetes_predictors,diabetes_target,test_size=0.2,random_state=0)

## How Normalization Works

When we need to normalize your data, we should import the *MinMaxScalar* from the `sklearn`

library.

#Data Normalization

from sklearn.preprocessing import MinMaxScaler

normalization = MinMaxScaler().fit(X_train)X_train_norm = pd.DataFrame(normalization.transform(X_train), columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age'])X_test_norm = pd.DataFrame(normalization.transform(X_test), columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age'])

Let’s see how normalization has affected our dataset:

## How Standardization Works

#Data Standardization

from sklearn.preprocessing import StandardScaler

standardization = StandardScaler()X_train_std = pd.DataFrame(standardization.fit_transform(X_train),

columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age'])X_test_std = pd.DataFrame(standardization.fit_transform(X_test),

columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age'])

Let’s see how standardization has affected our dataset:

## Comparing original, normalized, and standardized data

We can understand the distribution more by visualizing our data first. We can see the comparison between our unscaled and scaled data using boxplots.

#Comparing BoxPlotfig, axes = plt.subplots(3, 1, figsize=(12,12))

sns.boxplot(x="variable", y="value", data=pd.melt(X_train), ax=axes[0]).set(

title='Original Data', xlabel='')

sns.boxplot(x="variable", y="value", data=pd.melt(X_train_norm), ax=axes[1]).set(

title='Normalized Data', xlabel='')

sns.boxplot(x="variable", y="value", data=pd.melt(X_train_std), ax=axes[2]).set(

title='Standardized Data', xlabel='')

plt.show()

We can notice how scaling the features brings everything into perspective. Based on the graph above, our normalized data spread in the **range 0 and 1** and our standardized data has **a mean of 0** and **a standard deviation of 1**. The features are now more comparable and will have a similar effect on our learning models.

# Applying Scaling to Machine Learning Algorithms

It’s now time to train some machine learning algorithms on our data to compare the effects of different scaling techniques on the performance of the algorithm. I want to see the effect of scaling on three algorithms in particular: K-Nearest Neighbours, Support Vector Machine, and Decision Tree.

*Note: We are measuring the RMSE here because this competition evaluates the RMSE.*

**K-Nearest Neighbours (KNN)**

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. This is the simplest case. Suppose P1 is the point, for which the label needs to predict. First, you **find the closest point** to P1 and then the label of the nearest point assigned to P1.

Each object votes for their class and the class with the most votes is taken as the prediction. KNN has the following basic steps:

- Calculate distance
- Find closest neighbors
- Vote for labels

#KNN model

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import mean_squared_errorknn = KNeighborsClassifier(n_neighbors=6)

rmse_knn = []

trainX = [X_train, X_train_norm, X_train_std]

testX = [X_test, X_test_norm, X_test_std]for i in range(len(trainX)):

knn.fit(trainX[i],y_train)

pred = knn.predict(testX[i])

rmse_knn.append(np.sqrt(mean_squared_error(y_test,pred)))#result

df_knn = pd.DataFrame({'RMSE':rmse_knn},index=['Original','Normalized','Standardized'])

df_knn

You can see that scaling the features has brought down the RMSE score of our KNN model. Specifically, the normalized data perform a tad bit better than the standardized data.

**Support Vector Machine (SVM)**

Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. SVM constructs a **hyperplane in multidimensional space to separate different classes.** The core idea of SVM is to find a maximum marginal hyperplane (MMH) that best divides the dataset into classes.

SVM searches for the **maximum marginal hyperplane** in the following steps:

- Generate hyperplanes that segregate the classes in the best way. Left-hand side figure showing three hyperplanes
**black, blue, and orange.**Here, the blue and orange have higher classification errors, but the black is separating the two classes correctly. - Select the right hyperplane with the
**maximum segregation**from the either nearest data points as shown in the right-hand side figure.

#SVM model

from sklearn import svm

from sklearn.metrics import mean_squared_error

svm = svm.SVC(kernel='linear')

rmse_svm = []trainX = [X_train, X_train_norm, X_train_std]

testX = [X_test, X_test_norm, X_test_std]for i in range(len(trainX)):

svm.fit(trainX[i],y_train)

pred = svm.predict(testX[i])

rmse_svm.append(np.sqrt(mean_squared_error(y_test,pred)))#result

df_svm = pd.DataFrame({'RMSE':rmse_svm},index=['Original','Normalized','Standardized'])

df_svm

The standardized data has performed better than the normalized data.

**Decision Tree**

Decision Tree is one of the easiest and popular classification algorithms to understand and interpret. It can be utilized for both classification and regression kinds of problems.

A decision tree is a **flowchart-like tree structure** where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. This flowchart-like structure helps you in **decision-making**.

#Decision Tree Model

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import mean_squared_errordt = DecisionTreeClassifier()

rmse_dt = []trainX = [X_train,X_train_norm,X_train_std]

testX = [X_test,X_test_norm,X_test_std]for i in range(len(trainX)):

dt.fit(trainX[i],y_train)

pred = dt.predict(testX[i])

rmse_dt.append(np.sqrt(mean_squared_error(y_test,pred)))#result

df_dt = pd.DataFrame({'RMSE':rmse_dt},index=['Original','Normalized','Standardized'])

df_dt

You can see that the RMSE score has not moved an inch on scaling the features. So rest assured when you are using tree-based algorithms on your data!

# Summary

I hope you’ve enjoyed this brief tutorial on scaling data using normalization and standardization that have varying effects on the working of machine learning algorithms!

Keep in mind that there is no correct answer to when to use normalization over standardization and vice-versa. It all depends on your data and the algorithm you are using.

See ya in the next series!!!