How to handle Numerical Missing Values in data with Implementation
Given that you understand what missing values are and the various forms of missing values, if not please read this below short article first. It will help you to get started.
Overview
Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to as NA
(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed. For Example, Suppose different users being surveyed may choose not to share their income, some users may choose not to share the address in this way many datasets went missing.
We’ll discuss various methods to handle missing values effectively.
Methods to handle Missing Values
1. Delete Rows with Missing Values:-
Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.
dataframe.dropna(inplace=True)
Pros:
A model trained with the removal of all missing values creates a robust model.
Cons:
1. A huge amount of information is lost
2. If there are many missing values then this method will work poorly.
2. Mean/ Median /Mode imputation
NAN values are replaced by Mean, Median, or Mode of remaining values in the column. If your data contains an outlier go with the median.
df["column"] = df["column"].fillna(value=df["column"].mean()) df["column"] = df["column"].fillna(value=df["column"].median())
df["column"] = df["column"].fillna(value=df["column"].mode())
Pros:
1. Easy to implement(Robust to outliers)
2. Faster way to obtain the complete dataset
3. Prevent data loss
Cons:
1. Change or Distortion in the original variance
2. Impacts Correlation
3. Random Sample Imputation
Random sample imputation consists of taking random observations from the dataset and we use this observation to replace the NAN values.
When should it be used? It assumes that the data are missing completely at random(MCAR)
random_sample=df[variable].dropna().sample(df[variable].isnull().sum(),random_state=10)random_sample.index=df[df[variable].isnull()].indexdf.loc[df[variable].isnull(),variable]=random_sample
Pros:
There is less distortion in variance
Cons:
Every situation randomness won't work
4. Capturing NAN values with a new feature
We create a new feature and represent missing values with 1 and others with 0.
It works well if the data are not missing completely at random
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)
Age_NAN is our new feature.
df['Age'].fillna(df.Age.median(),inplace=True)
Here we fill the Feature Age NAN values with a median.
Pros:
Captures the importance of missing values
Cons:
Creating Additional Features(Curse of Dimensionality)
5. End of Distribution imputation
If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.
to find tail we can use 3 standard deviation
extreme=df.Age.mean()+3*df.Age.std()
df["Age"]=df["Age"].fillna(extreme)
Pros:
1. Easy to implement
2. Captures the importance of “missingness” if there is one
Cons:
1. Distortion of the original variance
2. This technique may mask true outliers in the distribution
6. KNN Imputer
KNN Imputer was first supported by Scikit-Learn in December 2019 when it released its version 0.22. This imputer utilizes the k-Nearest Neighbors method to replace the missing values in the datasets with the mean value from the parameter ‘n_neighbors’ nearest neighbors found in the training set. By default, it uses a Euclidean distance metric to impute the missing values.
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
imputer.fit_transform(X)OUTPUT:-
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])
Suppose, you run out of stock of necessary food items in your house, and due to the lockdown none of the nearby stores are open. Therefore, you ask your neighbors for help and you will end up cooking whatever they supply to you. This is an example of imputation from a 1-nearest neighbor (taking the help of your closest neighbor).
Instead, if you identify 3 neighbors from whom you ask for help and choose to combine the items supplied by 3 of your nearest neighbors, that is an example of imputation from 3-nearest neighbors. Similarly, missing values in datasets can be imputed with the help of values of observations from the k-Nearest Neighbours in your dataset. Neighboring points for a dataset are identified by certain distance metrics, generally euclidean distance.
Another critical point here is that the KNN Imptuer is a distance-based imputation method and it requires us to normalize our data. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. For simplicity, we will use Scikit-Learn’s MinMaxScaler which will scale our variables to have values between 0 and 1.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(imputer.fit_transform(df),columns = df.columns)
Conclusion
In this article, I have discussed the 6 best ways to handle missing values that can handle missing values in numerical columns. There is no thumb rule which will perform best it depends on your understanding of data and using the suitable method or try all the methods and compare their performance. KNN Imputer maintains the value and variability of your datasets and yet it is more precise and efficient than using the average values.
Thank you for reading, I hope you enjoyed it. Follow Me For More