Tensorflow NaN bug?

Nan Problem Overview

I'm using TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

as debug code to each loop, yields the following output:

Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.

I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.

Nan Solutions

Solution 1 - Nan

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.

Solution 2 - Nan

A bias free alternative.

Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuity--not the region near it.

Specific Answer

def cross_entropy(x, y, axis=-1):
  safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
  return -tf.reduce_sum(x * tf.log(safe_y), axis)

def entropy(x, axis=-1):
  return cross_entropy(x, x, axis)

But did it work?

x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512,  0.60943794, 0., -0.64332503], dtype=float32)  Yay! No NaN.

(Note: deleted dup cross-post.)

General Recipe

Use an inner tf.where to ensure the function has no asymptote. That is, alter the input to the inf generating function such that no inf can be created. Then use a second tf.where to always select the valid code-path. That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.

In Python code, the recipe is:

Instead of this:

tf.where(x_ok, f(x), safe_f(x))

Do this:

safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))

Example

Suppose you wish to compute:

f(x) = { 1/x, x!=0
       { 0,   x=0

A naive implementation results in NaNs in the gradient, i.e.,

def f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  return tf.where(x_ok, f(x), safe_f(x))

Does it work?

x = tf.constant([-1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ -1.,  nan,  -1.], dtype=float32)
#  ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the non-differentiated result.

The basic pattern for avoiding NaN gradients when using tf.where is to call tf.where twice. The innermost tf.where ensures that the result f(x) is always finite. The outermost tf.where ensures the correct result is chosen. For the running example, the trick plays out like this:

def safe_f(x):
  x_ok = tf.not_equal(x, 0.)
  f = lambda x: 1. / x
  safe_f = tf.zeros_like
  safe_x = tf.where(x_ok, x, tf.ones_like(x))
  return tf.where(x_ok, f(safe_x), safe_f(x))

But did it work?

x = tf.constant([-1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([-1.,  0., -1.], dtype=float32)
# ...yay! double-where trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).

Solution 3 - Nan

Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

Solution 4 - Nan

If y_conv is the result of a softmax, say, y_conv = tf.nn.softmax(x), then an even better solution is to replace it with log_softmax:

y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)

Solution 5 - Nan

You are trying to calculate cross-entropy using the standard formula. Not only the value is undefinined when x=0, it is also numerically unstable.

It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use hand-crafted formula, to tf.clip_by_value zeros to very small number in the log.

Solution 6 - Nan

Sometimes you use tf.sqrt() function without adding a small constant 1e-10 in it, inducing this nan problem.

Solution 7 - Nan

I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful for some other people who came here from google search.

Gradient clipping didn't help me because gradients turned nan in one batch update. In this case, you can replace nans with zeros with such lines:

opt = tf.train.AdamOptimizer(args.lr)
grads = opt.compute_gradients(loss)
grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads]
opt_op = opt.apply_gradients(grads2)

If you want to track if nans appeared you can use this code:

    was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads]))

2. Replace LSTMCell with LayerNormBasicLSTMCell - an LSTM cell with layer norm - something similar to batch norm between timesteps.

If you use regular recurrent state dropout you can replace it with "Recurrent Dropout without Memory Loss". Code:
```
LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8)
```

Note that you can also turn on the dropout feature alone without layer normalization:

    LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8)

Solution 8 - Nan

Besides all the great answers above, I will add mine. It's a scenario less common to run into, but does cause NaN: divide by zero.

In my network for a NLP task, there is a layer that does average pooling. Namely, each data is a sequence of tokens. My layer does some token embedding and then calculates the average of the embedded vector.

The average calculation is coded as

tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad))

Here pad is some dummy token I use in batch processing.

Now if some data contains empty token list (for whatever reason), its length (the denominator in the code snippet above) would be 0. Then it causes a divide by zero issue and the NaN will remain in all the following layers/ optimization steps.

In case anyone ran into this issue, I used tf.where to smooth those length:

sum_embedding = tf.reduce_sum(embedded, 1)
embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True)
embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length)))
avg_embedding = sum_embedding / embedding_length_smoothed

Essentially this treats all those data with 0-length token list to be of length 1, and avoids the NaN issue.

Solution 9 - Nan

Here is the implementation of the binary (sigmoid) and categorical (softmax) cross-entropy losses in TensorFlow 1.1:

As one can see in the binary case they consider some special cases to achieve numerical stability:

# The logistic loss formula from above is
#   x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
#   -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
#   max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(relu_logits - logits * labels,
                    math_ops.log1p(math_ops.exp(neg_abs_logits)),
                    name=name)

Solution 10 - Nan

2.0 Compatible Answer: Code to migrate @user1111929's Answer from

Tensorflow 1.x to Tensorflow 2.x, is shown below:

Tensorflow 1.x :

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Tensorflow 2.x:

cross_entropy = -tf.compat.v2.reduce_sum(y_*tf.log(tf.compat.v2.clip_by_value(y_conv,1e-10,1.0)))

cross_entropy = -tf.compat.v2.math.reduce_sum(y_*tf.log(tf.compat.v1.clip_by_value(y_conv,1e-10,1.0)))

Solution 11 - Nan

I was getting nans sometimes and not other times while working on a standard feed-forward network. I have previously used similar TensorFlow code and it worked fine.

It turns out that I imported the variable names by accident. So, as soon as the first row (the variable names) was selected in a batch, the nan losses started. Maybe keep an eye out for that?

Solution 12 - Nan

I will add here one of my previous problems with NaNs. I was using the sigmoid function as the activation of the last layer of my network. However, the sigmoid activation function uses the exponential function to be computed and I got some really big numbers entering the sigmoid.

It resulted in infinite gradients and some NaNs started to appear.

Solution 13 - Nan

I've been using Tensorflow Estimator, which I believe account for those division by zero and other numerical stability issues, and occasionally get this error (ERROR:tensorflow:Model diverged with loss = NaN during training). Most of the time when I get this is because my inputs include nans. So: be sure that your input dataframes (or whatever you use) don't have NaN values hidden somewhere in them.

Solution 14 - Nan

Another option is to use tf.math.xlogy function. The function description says "Returns 0 if x == 0, and x * log(y) otherwise, elementwise." You can find the documentation here: https://www.tensorflow.org/api_docs/python/tf/math/xlogy

Solution 15 - Nan

In tf.log(y_conv) if y_conv is the output of a sigmoid activation function, there is a better way to calculate tf.log(y_conv).

Let y_conv = sigmoid(x). Then,

   log(y_conv) = log(sigmoid(x))
=  log(1 / (1 + exp(-x)))
=  log(1 / (1 + exp(-x))) - x + x =
= -log(1 + exp(-x)) - log(exp(x)) + x =
= -log(1 + exp(x)) + x
=  x - softplus(x)

Content Type	Original Author	Original Content on Stackoverflow
Question	user1111929	View Question on Stackoverflow
Solution 1 - Nan	user1111929	View Answer on Stackoverflow
Solution 2 - Nan	jvdillon	View Answer on Stackoverflow
Solution 3 - Nan	Young Geng	View Answer on Stackoverflow
Solution 4 - Nan	mathguyjohn	View Answer on Stackoverflow
Solution 5 - Nan	Salvador Dali	View Answer on Stackoverflow
Solution 6 - Nan	jmir	View Answer on Stackoverflow
Solution 7 - Nan	alyaxey	View Answer on Stackoverflow
Solution 8 - Nan	Camuslu	View Answer on Stackoverflow
Solution 9 - Nan	Lenar Hoyt	View Answer on Stackoverflow
Solution 10 - Nan	Tensorflow Support	View Answer on Stackoverflow
Solution 11 - Nan	tf.nn.michael	View Answer on Stackoverflow
Solution 12 - Nan	Joseph Budin	View Answer on Stackoverflow
Solution 13 - Nan	rodrigo-silveira	View Answer on Stackoverflow
Solution 14 - Nan	mirkhosro	View Answer on Stackoverflow
Solution 15 - Nan	toliveira	View Answer on Stackoverflow