fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer
PythonScikit LearnData SciencePython Problem Overview
I am totally new to Machine Learning and I have been working with unsupervised learning technique.
Image shows my sample Data(After all Cleaning) Screenshot : Sample Data
I have this two Pipline built to Clean the Data:
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
print(type(num_attribs))
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
Then I did the union of this two pipelines and the code for the same is shown below :
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ])
Now I am trying to do fit_transform on the Data But Its showing Me the Error.
Code for Transformation:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Error message: >fit_transform() takes 2 positional arguments but 3 were given
Python Solutions
Solution 1 - Python
The Problem:
The pipeline is assuming LabelBinarizer's fit_transform
method is defined to take three positional arguments:
def fit_transform(self, x, y)
...rest of the code
while it is defined to take only two:
def fit_transform(self, x):
...rest of the code
Possible Solution:
This can be solved by making a custom transformer that can handle 3 positional arguments:
-
Import and make a new class:
from sklearn.base import TransformerMixin #gives fit_transform method for free class MyLabelBinarizer(TransformerMixin): def __init__(self, *args, **kwargs): self.encoder = LabelBinarizer(*args, **kwargs) def fit(self, x, y=0): self.encoder.fit(x) return self def transform(self, x, y=0): return self.encoder.transform(x)
-
Keep your code the same only instead of using LabelBinarizer(), use the class we created : MyLabelBinarizer().
Note: If you want access to LabelBinarizer Attributes (e.g. classes_), add the following line to the
fit
method:
self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_
Solution 2 - Python
I believe your example is from the book Hands-On Machine Learning with Scikit-Learn & TensorFlow. Unfortunately, I ran into this problem, as well. A recent change in scikit-learn
(0.19.0
) changed LabelBinarizer
's fit_transform
method. Unfortunately, LabelBinarizer
was never intended to work how that example uses it. You can see information about the change here and here.
Until they come up with a solution for this, you can install the previous version (0.18.0
) as follows:
$ pip install scikit-learn==0.18.0
After running that, your code should run without issue.
In the future, it looks like the correct solution may be to use a CategoricalEncoder
class or something similar to that. They have been trying to solve this problem for years apparently. You can see the new class here and further discussion of the problem here.
Solution 3 - Python
I think you are going through the examples from the book: Hands on Machine Learning with Scikit Learn and Tensorflow. I ran into the same problem when going through the example in Chapter 2.
As mentioned by other people, the problem is to do with sklearn's LabelBinarizer. It takes less args in its fit_transform method compared to other transformers in the pipeline. (only y when other transformers normally take both X and y, see here for details). That's why when we run pipeline.fit_transform, we fed more args into this transformer than required.
An easy fix I used is to just use OneHotEncoder and set the "sparse" to False to ensure the output is a numpy array same as the num_pipeline output. (this way you don't need to code up your own custom encoder)
your original cat_pipeline:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
you can simply change this part to:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('one_hot_encoder', OneHotEncoder(sparse=False))
])
You can go from here and everything should work.
Solution 4 - Python
Since LabelBinarizer doesn't allow more than 2 positional arguments you should create your custom binarizer like
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(new_housing)
Solution 5 - Python
I ran into the same problem and got it working by applying the workaround specified in the book's Github repo.
> Warning: earlier versions of the book used the LabelBinarizer class at > this point. Again, this was incorrect: just like the LabelEncoder > class, the LabelBinarizer class was designed to preprocess labels, not > input features. A better solution is to use Scikit-Learn's upcoming > CategoricalEncoder class: it will soon be added to Scikit-Learn, and > in the meantime you can use the code below (copied from Pull Request > #9151). > >
To save you some grepping here's the workaround, just paste and run it in a previous cell:
# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, do not try to understand it (yet).
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
handle_unknown='error'):
self.encoding = encoding
self.categories = categories
self.dtype = dtype
self.handle_unknown = handle_unknown
def fit(self, X, y=None):
"""Fit the CategoricalEncoder to X.
Parameters
----------
X : array-like, shape [n_samples, n_feature]
The data to determine the categories of each feature.
Returns
-------
self
"""
if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
template = ("encoding should be either 'onehot', 'onehot-dense' "
"or 'ordinal', got %s")
raise ValueError(template % self.handle_unknown)
if self.handle_unknown not in ['error', 'ignore']:
template = ("handle_unknown should be either 'error' or "
"'ignore', got %s")
raise ValueError(template % self.handle_unknown)
if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
raise ValueError("handle_unknown='ignore' is not supported for"
" encoding='ordinal'")
X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
n_samples, n_features = X.shape
self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
for i in range(n_features):
le = self._label_encoders_[i]
Xi = X[:, i]
if self.categories == 'auto':
le.fit(Xi)
else:
valid_mask = np.in1d(Xi, self.categories[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(Xi[~valid_mask])
msg = ("Found unknown categories {0} in column {1}"
" during fit".format(diff, i))
raise ValueError(msg)
le.classes_ = np.array(np.sort(self.categories[i]))
self.categories_ = [le.classes_ for le in self._label_encoders_]
return self
def transform(self, X):
"""Transform X using one-hot encoding.
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data to encode.
Returns
-------
X_out : sparse matrix or a 2-d array
Transformed input.
"""
X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
n_samples, n_features = X.shape
X_int = np.zeros_like(X, dtype=np.int)
X_mask = np.ones_like(X, dtype=np.bool)
for i in range(n_features):
valid_mask = np.in1d(X[:, i], self.categories_[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
# continue `The rows are marked `X_mask` and will be
# removed later.
X_mask[:, i] = valid_mask
X[:, i][~valid_mask] = self.categories_[i][0]
X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
if self.encoding == 'ordinal':
return X_int.astype(self.dtype, copy=False)
mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
indices = np.cumsum(n_values)
column_indices = (X_int + indices[:-1]).ravel()[mask]
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
n_features)[mask]
data = np.ones(n_samples * n_features)[mask]
out = sparse.csc_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
Solution 6 - Python
Simply, what you can do is define following class just before your pipeline:
class NewLabelBinarizer(LabelBinarizer):
def fit(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X)
def transform(self, X, y=None):
return super(NewLabelBinarizer, self).transform(X)
def fit_transform(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X).transform(X)
Then the rest of the code is like the one has mentioned in the book with a tiny modification in cat_pipeline
before pipeline concatenation - follow as:
cat_pipeline = Pipeline([
("selector", DataFrameSelector(cat_attribs)),
("label_binarizer", NewLabelBinarizer())])
You DONE!
Solution 7 - Python
Forget LaberBinarizer and use OneHotEncoder instead.
In case you use a LabelEncoder before OneHotEncoder to convert categories to integers, you can now use the OneHotEncoder directly.
Solution 8 - Python
I have also faced the same issue. Following link helped me in fixing this issue. https://github.com/ageron/handson-ml/issues/75
Summarizing changes to be made
- Define following class in your notebook
class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(SupervisionFriendlyLabelBinarizer,self).fit_transform(X)
- Modify following piece of code
cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', SupervisionFriendlyLabelBinarizer()),])
- Re-run the notebook. You will be able to run now
Solution 9 - Python
I got the same issue, and got resolved by using DataFrameMapper (need to install sklearn_pandas):
from sklearn_pandas import DataFrameMapper
cat_pipeline = Pipeline([ ('label_binarizer', DataFrameMapper([(cat_attribs, LabelBinarizer())])),
])
Solution 10 - Python
You can create one more Custom Transformer which does the encoding for you.
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return LabelEncoder().fit_transform(X);
In this example, we have done LabelEncoding but you can use LabelBinarizer as well
Solution 11 - Python
The LabelBinarizer
class is outdated for this example, and unfortunately was never meant to be used in the way that the book uses it.
You'll want to use the OrdinalEncoder
class from sklearn.preprocessing
, which is designed to
> "Encode categorical features as an integer array." (sklearn documentation).
So, just add:
from sklearn.preprocessing import OrdinalEncoder
then replace all mentions of LabelBinarizer()
with OrdinalEncoder()
in your code.
Solution 12 - Python
I ended up rolling my own
class LabelBinarizer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
X = self.prep(X)
unique_vals = []
for column in X.T:
unique_vals.append(np.unique(column))
self.unique_vals = unique_vals
def transform(self, X, y=None):
X = self.prep(X)
unique_vals = self.unique_vals
new_columns = []
for i, column in enumerate(X.T):
num_uniq_vals = len(unique_vals[i])
encoder_ring = dict(zip(unique_vals[i], range(len(unique_vals[i]))))
f = lambda val: encoder_ring[val]
f = np.vectorize(f, otypes=[np.int])
new_column = np.array([f(column)])
if num_uniq_vals <= 2:
new_columns.append(new_column)
else:
one_hots = np.zeros([num_uniq_vals, len(column)], np.int)
one_hots[new_column, range(len(column))]=1
new_columns.append(one_hots)
new_columns = np.concatenate(new_columns, axis=0).T
return new_columns
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
@staticmethod
def prep(X):
shape = X.shape
if len(shape) == 1:
X = X.values.reshape(shape[0], 1)
return X
Seems to work
lbn = LabelBinarizer()
thingy = np.array([['male','male','female', 'male'], ['A', 'B', 'A', 'C']]).T
lbn.fit(thingy)
lbn.transform(thingy)
returns
array([[1, 1, 0, 0],
[1, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 1]])
Solution 13 - Python
The easiest way is to replace LabelBinarize() inside your pipeline with OrdinalEncoder()
Solution 14 - Python
I've seen many custom label binarizers but there is one from this repo that worked for me.
class LabelBinarizerPipelineFriendly(LabelBinarizer):
def fit(self, X, y=None):
"""this would allow us to fit the model based on the X input."""
super(LabelBinarizerPipelineFriendly, self).fit(X)
def transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).transform(X)
def fit_transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)
Then edit the cat_pipeline
to this:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizerPipelineFriendly()),
])
Have a good one!
Solution 15 - Python
To perform one-hot encoding for multiple categorical features, we can create a new class which customizes our own multiple categorical features binarizer and plug it into categorical pipeline as follows.
Suppose CAT_FEATURES = ['cat_feature1', 'cat_feature2']
is a list of categorical features. The following scripts shall resolve the issue and produce what we want.
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
"""Perform one-hot encoding to categorical features."""
def __init__(self, cat_features):
self.cat_features = cat_features
def fit(self, X_cat, y=None):
return self
def transform(self, X_cat):
X_cat_df = pd.DataFrame(X_cat, columns=self.cat_features)
X_onehot_df = pd.get_dummies(X_cat_df, columns=self.cat_features)
return X_onehot_df.values
# Pipeline for categorical features.
cat_pipeline = Pipeline([
('selector', DataFrameSelector(CAT_FEATURES)),
('onehot_encoder', CustomLabelBinarizer(CAT_FEATURES))
])
Solution 16 - Python
We can just add attribute sparce_output=False
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer(sparse_output=False)),
])