How to do gradient clipping in pytorch?

PythonMachine LearningDeep LearningPytorchGradient Descent

Python Problem Overview


What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

Python Solutions


Solution 1 - Python

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()

Solution 2 - Python

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

> The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

Solution 3 - Python

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.

Solution 4 - Python

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping:

optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping

Solution 5 - Python

Well, I met with same err. I tried to use the clip norm but it doesn't work.

I don't want to change the network or add regularizers. So I change the optimizer to Adam, and it works.

Then I use the pretrained model from Adam to initate the training and use SGD + momentum for fine tuning. It is now working.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGulzarView Question on Stackoverflow
Solution 1 - PythonRahulView Answer on Stackoverflow
Solution 2 - Pythona_guestView Answer on Stackoverflow
Solution 3 - PythonGulzarView Answer on Stackoverflow
Solution 4 - PythonhkchengrexView Answer on Stackoverflow
Solution 5 - PythonCharles XuView Answer on Stackoverflow