Tensorflow NaN bug?
NanTensorflowNan Problem Overview
I'm using TensorFlow and I modified the tutorial example to take my RGB images.
The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received nonfinite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding
print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())
as debug code to each loop, yields the following output:
Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: 3.40282e+38 3.40282e+38 3.40282e+38 3.40282e+38
max b vales: 3.40282e+38 3.40282e+38 3.40282e+38 3.40282e+38
Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.
I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.
Nan Solutions
Solution 1  Nan
Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.
cross_entropy = tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the crossentropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
Replacing it with
cross_entropy = tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e10,1.0)))
solved all my problems.
Solution 2  Nan
A bias free alternative.
Many of the other solutions use clipping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may not be acceptable in all cases. As the following code demonstrates, we need only handle the point of discontinuitynot the region near it.
Specific Answer
def cross_entropy(x, y, axis=1):
safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)
return tf.reduce_sum(x * tf.log(safe_y), axis)
def entropy(x, axis=1):
return cross_entropy(x, x, axis)
But did it work?
x = tf.constant([0.1, 0.2, 0., 0.7])
e = entropy(x)
# ==> 0.80181855
g = tf.gradients(e, x)[0]
# ==> array([1.30258512, 0.60943794, 0., 0.64332503], dtype=float32) Yay! No NaN.
(Note: deleted dup crosspost.)
General Recipe
Use an inner tf.where
to ensure the function has no asymptote.
That is, alter the input to the inf generating function such that no inf can be created.
Then use a second tf.where
to always select the valid codepath.
That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.
In Python code, the recipe is:
Instead of this:
tf.where(x_ok, f(x), safe_f(x))
Do this:
safe_x = tf.where(x_ok, x, safe_x)
tf.where(x_ok, f(safe_x), safe_f(x))
Example
Suppose you wish to compute:
f(x) = { 1/x, x!=0
{ 0, x=0
A naive implementation results in NaNs in the gradient, i.e.,
def f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
return tf.where(x_ok, f(x), safe_f(x))
Does it work?
x = tf.constant([1., 0, 1])
tf.gradients(f(x), x)[0].eval()
# ==> array([ 1., nan, 1.], dtype=float32)
# ...bah! We have a NaN at the asymptote despite not having
# an asymptote in the nondifferentiated result.
The basic pattern for avoiding NaN gradients when using tf.where
is to call tf.where
twice. The innermost tf.where
ensures that the result f(x)
is always finite. The outermost tf.where
ensures the correct result is chosen. For the running example, the trick plays out like this:
def safe_f(x):
x_ok = tf.not_equal(x, 0.)
f = lambda x: 1. / x
safe_f = tf.zeros_like
safe_x = tf.where(x_ok, x, tf.ones_like(x))
return tf.where(x_ok, f(safe_x), safe_f(x))
But did it work?
x = tf.constant([1., 0, 1])
tf.gradients(safe_f(x), x)[0].eval()
# ==> array([1., 0., 1.], dtype=float32)
# ...yay! doublewhere trick worked. Notice that the gradient
# is now a constant at the asymptote (as opposed to being NaN).
Solution 3  Nan
Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.
cross_entropy = tf.reduce_sum(y_*tf.log(y_conv + 1e10))
Solution 4  Nan
If y_conv
is the result of a softmax, say, y_conv = tf.nn.softmax(x)
, then an even better solution is to replace it with log_softmax
:
y = tf.nn.log_softmax(x)
cross_entropy = tf.reduce_sum(y_*y)
Solution 5  Nan
You are trying to calculate crossentropy using the standard formula. Not only the value is undefinined when x=0
, it is also numerically unstable.
It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use handcrafted formula, to tf.clip_by_value zeros to very small number in the log.
Solution 6  Nan
Sometimes you use tf.sqrt()
function without adding a small constant 1e10
in it, inducing this nan
problem.
Solution 7  Nan
I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful for some other people who came here from google search.

Gradient clipping didn't help me because gradients turned nan in one batch update. In this case, you can replace nans with zeros with such lines:
opt = tf.train.AdamOptimizer(args.lr) grads = opt.compute_gradients(loss) grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads] opt_op = opt.apply_gradients(grads2)
If you want to track if nans appeared you can use this code:
was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads]))
2. Replace LSTMCell with LayerNormBasicLSTMCell  an LSTM cell with layer norm  something similar to batch norm between timesteps.

If you use regular recurrent state dropout you can replace it with "Recurrent Dropout without Memory Loss". Code:
LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8)
Note that you can also turn on the dropout feature alone without layer normalization:
LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8)
Solution 8  Nan
Besides all the great answers above, I will add mine. It's a scenario less common to run into, but does cause NaN: divide by zero.
In my network for a NLP task, there is a layer that does average pooling. Namely, each data is a sequence of tokens. My layer does some token embedding and then calculates the average of the embedded vector.
The average calculation is coded as
tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad))
Here pad
is some dummy token I use in batch processing.
Now if some data contains empty token list (for whatever reason), its length (the denominator in the code snippet above) would be 0. Then it causes a divide by zero issue and the NaN will remain in all the following layers/ optimization steps.
In case anyone ran into this issue, I used tf.where
to smooth those length:
sum_embedding = tf.reduce_sum(embedded, 1)
embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True)
embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length)))
avg_embedding = sum_embedding / embedding_length_smoothed
Essentially this treats all those data with 0length token list to be of length 1, and avoids the NaN issue.
Solution 9  Nan
Here is the implementation of the binary (sigmoid) and categorical (softmax) crossentropy losses in TensorFlow 1.1:
 https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_impl.py#L159
 https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/python/ops/nn_ops.py#L1609
As one can see in the binary case they consider some special cases to achieve numerical stability:
# The logistic loss formula from above is
# x  x * z + log(1 + exp(x))
# For x < 0, a more numerically stable formula is
# x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0)  x * z + log(1 + exp(abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, logits, logits)
return math_ops.add(relu_logits  logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)),
name=name)
Solution 10  Nan
2.0 Compatible Answer: Code to migrate @user1111929's Answer from
Tensorflow 1.x
to Tensorflow 2.x
, is shown below:
Tensorflow 1.x :
cross_entropy = tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e10,1.0)))
Tensorflow 2.x:
cross_entropy = tf.compat.v2.reduce_sum(y_*tf.log(tf.compat.v2.clip_by_value(y_conv,1e10,1.0)))
or
cross_entropy = tf.compat.v2.math.reduce_sum(y_*tf.log(tf.compat.v1.clip_by_value(y_conv,1e10,1.0)))
Solution 11  Nan
I was getting nans sometimes and not other times while working on a standard feedforward network. I have previously used similar TensorFlow code and it worked fine.
It turns out that I imported the variable names by accident. So, as soon as the first row (the variable names) was selected in a batch, the nan losses started. Maybe keep an eye out for that?
Solution 12  Nan
I will add here one of my previous problems with NaNs. I was using the sigmoid function as the activation of the last layer of my network. However, the sigmoid activation function uses the exponential function to be computed and I got some really big numbers entering the sigmoid.
It resulted in infinite gradients and some NaNs started to appear.
Solution 13  Nan
I've been using Tensorflow Estimator, which I believe account for those division by zero and other numerical stability issues, and occasionally get this error (ERROR:tensorflow:Model diverged with loss = NaN during training
). Most of the time when I get this is because my inputs include nan
s. So: be sure that your input dataframes (or whatever you use) don't have NaN values hidden somewhere in them.
Solution 14  Nan
Another option is to use tf.math.xlogy
function. The function description says
"Returns 0 if x == 0, and x * log(y) otherwise, elementwise."
You can find the documentation here: https://www.tensorflow.org/api_docs/python/tf/math/xlogy
Solution 15  Nan
In tf.log(y_conv)
if y_conv
is the output of a sigmoid activation function, there is a better way to calculate tf.log(y_conv)
.
Let y_conv = sigmoid(x)
. Then,
log(y_conv) = log(sigmoid(x))
= log(1 / (1 + exp(x)))
= log(1 / (1 + exp(x)))  x + x =
= log(1 + exp(x))  log(exp(x)) + x =
= log(1 + exp(x)) + x
= x  softplus(x)