Impute categorical missing values in scikit-learn

PythonPandasScikit LearnImputation

Python Problem Overview


I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Any help would be very welcome

Python Solutions


Solution 1 - Python

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.
            
        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.
            
        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

which prints,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

Solution 2 - Python

You can use sklearn_pandas.CategoricalImputer for the categorical columns. Details:

First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform() takes a pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion, for example:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

Now, in the num_pipeline you can simply use sklearn.preprocessing.Imputer(), but in the cat_pipline, you can use CategoricalImputer() from the sklearn_pandas package.

note: sklearn-pandas package can be installed with pip install sklearn-pandas, but it is imported as import sklearn_pandas

Solution 3 - Python

There is a package sklearn-pandas which has option for imputation for categorical variable https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)

Solution 4 - Python

  • strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.

          from sklearn.preprocessing import Imputer
          impute = Imputer(strategy='mean')
          for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
                xx[cols] = impute.fit_transform(xx[[cols]])
    
  • Custom Imputer :

         from sklearn.preprocessing import Imputer
         from sklearn.base import TransformerMixin
    
         class CustomImputer(TransformerMixin):
               def __init__(self, cols=None, strategy='mean'):
                     self.cols = cols
                     self.strategy = strategy
      
               def transform(self, df):
                     X = df.copy()
                     impute = Imputer(strategy=self.strategy)
                     if self.cols == None:
                            self.cols = list(X.columns)
                     for col in self.cols:
                            if X[col].dtype == np.dtype('O') : 
                                   X[col].fillna(X[col].value_counts().index[0], inplace=True)
                            else : X[col] = impute.fit_transform(X[[col]])
          
                     return X
    
               def fit(self, *_):
                     return self
    
  • Dataframe:

            X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                       francisco', 'tokyo'], 
                'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
                'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                  'somewhat like', 'dislike'], 
                'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                  city	          boolean	ordinal_column	quantitative_column
              0	tokyo	          yes	    somewhat like	1.0
              1	NaN	              no	    like	        11.0
              2	london	          NaN	    somewhat like	-0.5
              3	seattle	          no	    like	        10.0
              4	san francisco	  no	    somewhat like	NaN
              5	tokyo	          yes	    dislike	        20.0
    
    1. Can be used with list of similar type of features.

      cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
      cci.fit_transform(X)
      
  • can be used with strategy = median

       sd = CustomImputer(['quantitative_column'], strategy = 'median')
       sd.fit_transform(X)
    
    1. Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.

      call = CustomImputer()
      call.fit_transform(X)   
      

Solution 5 - Python

Copying and modifying sveitser's answer, I made an imputer for a pandas.Series object

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

To use it you would do:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

Solution 6 - Python

Inspired by the answers here and for the want of a goto Imputer for all use-cases I ended up writing this. It supports four strategies for imputation mean, mode, median, fill works on both pd.DataFrame and Pd.Series.

mean and median works only for numeric data, mode and fill works for both numeric and categorical data.

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

usage

>> df	
    MasVnrArea	FireplaceQu
Id	
1	196.0	NaN
974	196.0	NaN
21	380.0	Gd
5	350.0	TA
651	NaN	    Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea	FireplaceQu
Id		
1	196.0	Gd
974	196.0	Gd
21	380.0	Gd
5	350.0	TA
651	196.0	Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea	FireplaceQu
Id		
1	196.0	NA
974	196.0	NA
21	380.0	Gd
5	350.0	TA
651	0.0	    Gd 

Solution 7 - Python

This code fills in a series with the most frequent category:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

Outputs:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object

Solution 8 - Python

sklearn.impute.SimpleImputer instead of Imputer can easily resolve this, which can handle categorical variable.

As per the Sklearn documentation: If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") 
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])

Solution 9 - Python

Missforest can be used for the imputation of missing values in categorical variable along with the other categorical features. It works in an iterative way similar to IterativeImputer taking random forest as a base model.

Following is the code to label encode the features along with the target variable, fitting model to impute nan values, and encoding the features back

import sklearn.neighbors._base
from sklearn.preprocessing import LabelEncoder
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest

def label_encoding(df, columns):
    """
    Label encodes the set of the features to be used for imputation
    Args:
        df: data frame (processed data)
        columns: list (features to be encoded)
    Returns: dictionary
    """
    encoders = dict()
    for col_name in columns:
        series = df[col_name]
        label_encoder = LabelEncoder()
        df[col_name] = pd.Series(
            label_encoder.fit_transform(series[series.notnull()]),
            index=series[series.notnull()].index
        )
        encoders[col_name] = label_encoder
    return encoders

# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest 
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
    data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))

Solution 10 - Python

Similar. Modify Imputer for strategy='most_frequent':

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)
        
    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)
        
    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

where pandas.DataFrame.mode() finds the most frequent value for each column and then pandas.DataFrame.fillna() fills missing values with these. Other strategy values are still handled the same way by Imputer.

Solution 11 - Python

You could try the following:

replace = df.<yourcolumn>.value_counts().argmax()

df['<yourcolumn>'].fillna(replace, inplace=True) 

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionnight_batView Question on Stackoverflow
Solution 1 - PythonsveitserView Answer on Stackoverflow
Solution 2 - PythonAustinView Answer on Stackoverflow
Solution 3 - PythonprashanthView Answer on Stackoverflow
Solution 4 - PythonPiyushView Answer on Stackoverflow
Solution 5 - Pythonuser1367204View Answer on Stackoverflow
Solution 6 - PythonGautham KumaranView Answer on Stackoverflow
Solution 7 - PythonscottlittleView Answer on Stackoverflow
Solution 8 - PythonDigvijayView Answer on Stackoverflow
Solution 9 - Pythonshivam nijhawanView Answer on Stackoverflow
Solution 10 - PythonqApView Answer on Stackoverflow
Solution 11 - Pythonsunnyspain1View Answer on Stackoverflow