What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

Neural NetworkTensorflowSoftmaxCross Entropy

Neural Network Problem Overview


I recently came across tf.nn.sparse_softmax_cross_entropy_with_logits and I can not figure out what the difference is compared to tf.nn.softmax_cross_entropy_with_logits.

Is the only difference that training vectors y have to be one-hot encoded when using sparse_softmax_cross_entropy_with_logits?

Reading the API, I was unable to find any other difference compared to softmax_cross_entropy_with_logits. But why do we need the extra function then?

Shouldn't softmax_cross_entropy_with_logits produce the same results as sparse_softmax_cross_entropy_with_logits, if it is supplied with one-hot encoded training data/vectors?

Neural Network Solutions


Solution 1 - Neural Network

Having two different functions is a convenience, as they produce the same result.

The difference is simple:

  • For sparse_softmax_cross_entropy_with_logits, labels must have the shape [batch_size] and the dtype int32 or int64. Each label is an int in range [0, num_classes-1].
  • For softmax_cross_entropy_with_logits, labels must have the shape [batch_size, num_classes] and dtype float32 or float64.

Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits.

Another tiny difference is that with sparse_softmax_cross_entropy_with_logits, you can give -1 as a label to have loss 0 on this label.

Solution 2 - Neural Network

I would just like to add 2 things to accepted answer that you can also find in TF documentation.

First: > tf.nn.softmax_cross_entropy_with_logits > > NOTE: While the classes are mutually exclusive, their probabilities > need not be. All that is required is that each row of labels is a > valid probability distribution. If they are not, the computation of > the gradient will be incorrect.

Second:

> tf.nn.sparse_softmax_cross_entropy_with_logits > > NOTE: For this operation, the probability of a given label is > considered exclusive. That is, soft classes are not allowed, and the > labels vector must provide a single specific index for the true class > for each row of logits (each minibatch entry).

Solution 3 - Neural Network

Both functions computes the same results and sparse_softmax_cross_entropy_with_logits computes the cross entropy directly on the sparse labels instead of converting them with one-hot encoding.

You can verify this by running the following program:

import tensorflow as tf
from random import randint

dims = 8
pos  = randint(0, dims - 1)

logits = tf.random_uniform([dims], maxval=3, dtype=tf.float32)
labels = tf.one_hot(pos, dims)

res1 = tf.nn.softmax_cross_entropy_with_logits(       logits=logits, labels=labels)
res2 = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=tf.constant(pos))

with tf.Session() as sess:
    a, b = sess.run([res1, res2])
    print a, b
    print a == b

Here I create a random logits vector of length dims and generate one-hot encoded labels (where element in pos is 1 and others are 0).

After that I calculate softmax and sparse softmax and compare their output. Try rerunning it a few times to make sure that it always produce the same output

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questiondaniel451View Question on Stackoverflow
Solution 1 - Neural NetworkOlivier MoindrotView Answer on Stackoverflow
Solution 2 - Neural NetworkDrag0View Answer on Stackoverflow
Solution 3 - Neural NetworkSalvador DaliView Answer on Stackoverflow