Python Series 2: Handling Missing Values in Python

Photo by Josh Boot on Unsplash
  1. Pairwise deletions, deleting only missing values.
  2. Listwise deletions, deleting the row containing the missing values.
  3. Dropping entire columns, deleting the column containing the missing values.
  1. General imputations, imputing with a constant, mean, median, and mode.
  2. Advanced imputations, using K-Nearest Neighbors (KNN) and Multivariate Imputation by Chained Equations (MICE)

Reading in The Dataset

We will see an example predicting housing prices from the Melbourne Housing data. It was created in September 2017. Additionally, homes with no Price have been removed. Let’s begin by reading our data as a pandas DataFrame:

##Data Inputing
import pandas as pd
housing = pd.read_csv('.../input/melb_data.csv/melb_data.csv')
housing.info()
housing.head()
The Image shows the info of the dataset
The sample of the dataset.
##Target and Predictors
housing_target = housing.Price
housing_predictors = housing.drop(['Price'], axis=1)
#For keeping the example simple, we'll use only numeric predictors
housing_numeric_predictors = housing_predictors.select_dtypes(exclude=['object'])

Detecting Missing Values

We need to create a function to measure the quality of an approach. So, we divide our data into training and test. Also, we have to detect first what variables that contain missing values.

#Splitting our data into Train and Test
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_numeric_predictors, housing_target, test_size=0.2, random_state=0)#Detecting Missing Values of train data
train_missing_val = X_train.isnull().sum()
train_missing_percent = X_train.isnull().sum()*100/len(X_train)
train_missing_table = pd.concat([train_missing_val,train_missing_percent],axis=1)
train_missing_summary = train_missing_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
#Detecting Missing Values of test data
test_missing_val = X_test.isnull().sum()
test_missing_percent = X_test.isnull().sum()*100/len(X_test)
test_missing_table = pd.concat([test_missing_val,test_missing_percent],axis=1)
test_missing_summary = test_missing_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
The summary of missing value in both train and test data

Create Function to Measure Quality of An Approach

We’ve loaded a function mae_score(X_train, X_test, y_train, y_test) to compare the quality of different approaches to missing values. This function reports the out-of-sample Mean Absolute Error (MAE) score from a Random Forest.

#Function to measure the quality of an approach
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def mae_score(X_train, X_test, y_train, y_test):
model = RandomForestRegressor()
model.fit(X_train, y_train)
predicts = model.predict(X_test)
return mean_absolute_error(y_test, predicts)

DELETIONS

Deletion means deleting the missing values from a dataset. This is however not recommended as it might result in loss of information from the dataset. We should only delete the missing values from a dataset if their proportion is very small. Deletions are further of three types:

1. Pairwise Deletions

Pairwise Deletion is used when values are missing completely at random. During Pairwise deletion, only the missing values are deleted. All operations in pandas like mean, sum, etc intrinsically skip missing values.

2. Listwise Deletion

During Listwise deletion, complete rows(which contain the missing values) are deleted. As a result, it is also called Complete Case deletion. Like Pairwise deletion, listwise deletions are also only used for missing completely at random values.

#Reset index of predictors
X_train_reset = X_train.reset_index()
X_test_reset = X_test.reset_index()
#Get the target without missing value in predictors
listwise_y_train_index = X_train_reset.loc[X_train_reset.isnull().sum(1)>0].index
listwise_y_train = y_train.drop(y_train.index[listwise_y_train_index])
listwise_y_test_index = X_test_reset.loc[X_test_reset.isnull().sum(1)>0].index
listwise_y_test = y_test.drop(y_test.index[listwise_y_test_index])
#Drop rows that containing missing values
listwise_X_train = X_train.dropna()
listwise_X_test = X_test.dropna()
#Get a MAE score
print("Mean Absolute Error from Listwise Deletion:")
print(mae_score(listwise_X_train, listwise_X_test, listwise_y_train, listwise_y_test))
MAE Score of dropping rows with missing values

3. Dropping entire columns

If a column contains a lot of missing values, say more than 80%, and the feature is not significant, you might want to delete that feature. However, again, it is not a good methodology to delete data.

#Dropping entire columns
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
#Get a MAE score
print("Mean Absolute Error from dropping columns with Missing Values:")
print(mae_score(reduced_X_train, reduced_X_test, y_train, y_test))
MAE Score of dropping columns with missing values

IMPUTATIONS

Imputation refers to replacing missing data with substituted values. There are a lot of ways in which the missing values can be imputed depending upon the nature of the problem and data. Depending upon the nature of the problem, imputation techniques can be broadly they can be classified as follows:

1. General Imputations

a. Constant Imputer

##General Imputations
from sklearn.impute import SimpleImputer
#Constant Imputer
constant_imputer = SimpleImputer(strategy='constant')
constant_X_train = constant_imputer.fit_transform(X_train)
constant_X_test = constant_imputer.transform(X_test)
print("Mean Absolute Error from Constant Imputation:")
print(mae_score(constant_X_train, constant_X_test, y_train, y_test))
MAE score of constant imputer
#Mean Imputer
mean_imputer = SimpleImputer(strategy='mean')
mean_X_train = mean_imputer.fit_transform(X_train)
mean_X_test = mean_imputer.transform(X_test)
print("Mean Absolute Error from Mean Imputation:")
print(mae_score(mean_X_train, mean_X_test, y_train, y_test))
MAE score of mean imputer
#Median Imputer
median_imputer = SimpleImputer(strategy='median')
median_X_train = median_imputer.fit_transform(X_train)
median_X_test = median_imputer.transform(X_test)
print("Mean Absolute Error from Median Imputation:")
print(mae_score(median_X_train, median_X_test, y_train, y_test))
MAE score of median imputer
#Mode Imputer
mode_imputer = SimpleImputer(strategy='most_frequent')
mode_X_train = mode_imputer.fit_transform(X_train)
mode_X_test = mode_imputer.transform(X_test)
print("Mean Absolute Error from Mode Imputation:")
print(mae_score(mode_X_train, mode_X_test, y_train, y_test))
MAE score of mode imputer

2. Advanced Imputations

Advanced imputation techniques use machine learning algorithms to impute the missing values in a dataset, unlike the previous techniques where we used other column values to predict the missing values. We shall look at the following two techniques in this notebook:

##Advanced Imputer#KNN
from sklearn.impute import KNNImputer
KNN_imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
KNN_X_train = KNN_imputer.fit_transform(X_train)
KNN_X_test = KNN_imputer.transform(X_test)
print("Mean Absolute Error from KNN Imputation:")
print(mae_score(KNN_X_train, KNN_X_test, y_train, y_test))
MAE score of KNN Imputer
#MICE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
mice_imputer = IterativeImputer()
mice_X_train = mice_imputer.fit_transform(X_train)
mice_X_test = mice_imputer.transform(X_test)
print("Mean Absolute Error from MICE Imputation:")
print(mae_score(mice_X_train, mice_X_test, y_train, y_test))
MAE score of MICE imputer

Conclusion

As is common, imputing missing values allowed us to improve our model compared to dropping those columns. We got an additional boost by tracking what values had been imputed.

Summary of MAE scores from all methods

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store