Why the 6 in relu6?

Tensorflow

Tensorflow Problem Overview


I've hacked a deep feed forward NN from scratch in R, and it seems more stable with "hard sigmoid" activations - max(0,min(1,x)) - than ReLU. Trying to port it to TensorFlow, and noticed that they don't have this activation function built in, only relu6, which uses an upper cutoff at 6. Is there a reason for this? (I realize that you could do relu6(x*6)/6, but if the TF guys put the 6 there for a good reason, I'd like to know.) Also, I'd like to know if others have explosion problems with ReLU in feed forward nets (I'm aware of RNN issues).

Tensorflow Solutions


Solution 1 - Tensorflow

From this reddit thread:

> This is useful in making the networks ready for fixed-point inference. > If you unbound the upper limit, you lose too many bits to the Q part > of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a > max of 3 bits (upto 8) leaving 4/5 bits for .f

It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.

Solution 2 - Tensorflow

Tensorflows documentation (https://www.tensorflow.org/api_docs/python/tf/nn/relu6) points to the following paper:

> ... First, we cap the units at 6, so our ReLU activation function is y = min(max(x, 0), 6). In our tests, this encourages the model to learn sparse features earlier. In the formulation of [8], this is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinute amount. We will refer to ReLU units capped at n as ReLU-n units.

http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf

Since it originates from the paper, I suspect that they tested it with different n's and got the best results for their testset with n=6.

Solution 3 - Tensorflow

If you want a different number, for instance, if you are using hardcoded weights with binary data and want ReLU1() it can be implemented as follows:

class ReLU1(nn.Module):
    def forward(self, x):
        return F.relu6(x * 6.0) / 6.0

class ReLUX(nn.Module):
    def __init__(self, max_value: float=1.0):
        super(ReLUX, self).__init__()
        self.max_value = float(max_value)
        self.scale     = 6.0/self.max_value

    def forward(self, x):
        return F.relu6(x * self.scale) / (self.scale)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFaultyBagnoseView Question on Stackoverflow
Solution 1 - TensorflowGPhiloView Answer on Stackoverflow
Solution 2 - TensorflowRickView Answer on Stackoverflow
Solution 3 - TensorflowJames McGuiganView Answer on Stackoverflow