sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

PythonPython 2.7Scikit LearnValueerror

Python Problem Overview


I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have run

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

I tried using

mat[np.isfinite(mat) == True] = 0

to remove the infinite values but this did not work either. What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?

I am using anaconda and python 2.7.9.

Python Solutions


Solution 1 - Python

This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check wheter any of the element is NaN, and not whether the return value of the any function is a number...

Solution 2 - Python

I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

df = df[df.label=='desired_one']

Solution 3 - Python

This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

import pandas as pd
import numpy as np

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

Solution 4 - Python

In most cases getting rid of infinite and null values solve this problem.

get rid of infinite values.

df.replace([np.inf, -np.inf], np.nan, inplace=True)

get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values

df.fillna(999, inplace=True)

Solution 5 - Python

This is the check on which it fails:

Which says

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.

Solution 6 - Python

The Dimensions of my input array were skewed, as my input csv had empty spaces.

Solution 7 - Python

With this version of python 3:

/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)

Looking at the details of the error, I found the lines of codes causing the failure:

/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     56             and not np.isfinite(X).all()):
     57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
     59 
     60 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)

Then with a quick and dirty loop, I was able to find that my data indeed contains nans:

print(p[:,0].shape)
index = 0
for i in p[:,0]:
    if not np.isfinite(i):
        print(index, i)
    index +=1

(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...

Now all I have to do is remove the values at these indexes.

Solution 8 - Python

None of the answers here worked for me. This was what worked.

Test_y = np.nan_to_num(Test_y)

It replaces the infinity values with high finite values and the nan values with numbers

Solution 9 - Python

I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:

X = X.values.astype(np.float)
y = y.values.astype(np.float)

Edit: The originally suggested X.as_matrix() is Deprecated

Solution 10 - Python

I had the error after trying to select a subset of rows:

df = df.reindex(index=my_index)

Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.

Solution 11 - Python

Remove all infinite values:

(and replace with min or max for that column)
import numpy as np

# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[       inf,        inf,        inf,        inf,        inf],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [      -inf,       -inf,       -inf,       -inf,       -inf],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]

# go through matrix one column at a time and replace  + and -infinity 
# with the max or min for that column
for i in range(matrix.shape[1]):
    matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
    matrix[:, i][matrix[:, i] == np.inf] = maxs[i]

>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

Solution 12 - Python

Problem seems to occur in DecisionTreeClassifier input check, Try

X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)

Solution 13 - Python

i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc

Solution 14 - Python

I would like to propose a solution for numpy that worked well for me. The line

from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max

substitues all infite values of a numpy array with the maximum float64 number.

Solution 15 - Python

I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code

df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()

Solution 16 - Python

In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.

Solution 17 - Python

dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

This worked for me

Solution 18 - Python

If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.

Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.

Solution 19 - Python

I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.

Solution 20 - Python

Using isneginf may help. http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf

x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with

Solution 21 - Python

Note: This solution only applies if you consciously want to keep NaN entries in your dataset.

This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.

Solution

My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet

import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
    # your code that raises ValueError

this will replace the _assert_all_finite by a dummy mock function so it won't get executed.

Please note that patching is not a recommended practice and might result in unpredictable behaviour!


EDIT: This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)

Solution 22 - Python

Puff !! In my case the problem was about NaN values...

You can list your columns that had NaN with this function

your_data.isnull().sum()

and then you can fill these NAN values in your dataset file.

Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."

your_data[:] = np.nan_to_num(your_data)

from numpy.nan_to_num

Solution 23 - Python

After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.

Solution 24 - Python

try

mat.sum()

If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.

see the _assert_all_finite function in validation.py from the scikit source code:

if is_float and np.isfinite(X.sum()):
    pass
elif is_float:
    msg_err = "Input contains {} or a value too large for {!r}."
    if (allow_nan and np.isinf(X).any() or
            not allow_nan and not np.isfinite(X).all()):
        type_err = 'infinity' if allow_nan else 'NaN, infinity'
        # print(X.sum())
        raise ValueError(msg_err.format(type_err, X.dtype))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEthan WaldieView Question on Stackoverflow
Solution 1 - PythonMarcus MüllerView Answer on Stackoverflow
Solution 2 - PythonJun WangView Answer on Stackoverflow
Solution 3 - PythonBoernView Answer on Stackoverflow
Solution 4 - PythonNatheer AlabsiView Answer on Stackoverflow
Solution 5 - PythontuxdnaView Answer on Stackoverflow
Solution 6 - PythonEthan WaldieView Answer on Stackoverflow
Solution 7 - PythonRaphvannsView Answer on Stackoverflow
Solution 8 - PythonNetEmmanuelView Answer on Stackoverflow
Solution 9 - PythontekumaraView Answer on Stackoverflow
Solution 10 - PythonElias StrehleView Answer on Stackoverflow
Solution 11 - PythonRenel ChesakView Answer on Stackoverflow
Solution 12 - PythonMayukh PankajView Answer on Stackoverflow
Solution 13 - PythonCohenView Answer on Stackoverflow
Solution 14 - PythonHagbardView Answer on Stackoverflow
Solution 15 - PythonGolden LionView Answer on Stackoverflow
Solution 16 - PythonlucaView Answer on Stackoverflow
Solution 17 - PythonParthibanView Answer on Stackoverflow
Solution 18 - PythonChris CooperView Answer on Stackoverflow
Solution 19 - PythonGoel NimiView Answer on Stackoverflow
Solution 20 - PythonJoyanta J. MondalView Answer on Stackoverflow
Solution 21 - PythonTomasz BartkowiakView Answer on Stackoverflow
Solution 22 - PythonM E S A B OView Answer on Stackoverflow
Solution 23 - PythonElhamMotamediView Answer on Stackoverflow
Solution 24 - PythonRick HillView Answer on Stackoverflow