How to do gradient clipping in pytorch?

Python Machine Learning Deep Learning Pytorch Gradient Descent

Python Problem Overview

What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

Python Solutions

Solution 1 - Python

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()

Solution 2 - Python

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

> The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

Solution 3 - Python

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.

Solution 4 - Python

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping:

optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping

Solution 5 - Python

Well, I met with same err. I tried to use the clip norm but it doesn't work.

I don't want to change the network or add regularizers. So I change the optimizer to Adam, and it works.

Then I use the pretrained model from Adam to initate the training and use SGD + momentum for fine tuning. It is now working.

Content Type	Original Author	Original Content on Stackoverflow
Question	Gulzar	View Question on Stackoverflow
Solution 1 - Python	Rahul	View Answer on Stackoverflow
Solution 2 - Python	a_guest	View Answer on Stackoverflow
Solution 3 - Python	Gulzar	View Answer on Stackoverflow
Solution 4 - Python	hkchengrex	View Answer on Stackoverflow
Solution 5 - Python	Charles Xu	View Answer on Stackoverflow