How to fix this strange error: "RuntimeError: CUDA error: out of memory"
PythonPytorchPython Problem Overview
I successfully trained the network but got this error during validation:
> RuntimeError: CUDA error: out of memory
Python Solutions
Solution 1 - Python
The error occurs because you ran out of memory on your GPU.
One way to solve it is to reduce the batch size until your code runs without this error.
Solution 2 - Python
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Even if docs guides with float(), in case of me, item() also worked like
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...
Solution 3 - Python
The best way is to find the process engaging gpu memory and kill it:
find the PID of python process from:
nvidia-smi
copy the PID and kill it by:
sudo kill -9 pid
Solution 4 - Python
I had the same issue and this code worked for me :
import gc
gc.collect()
torch.cuda.empty_cache()
Solution 5 - Python
It might be for a number of reasons that I try to report in the following list:
- Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
- RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
- Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
- Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the
biggest_batch_first
description for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
Solution 6 - Python
I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.
Check whether the cause is really due to your GPU memory, by a code below.
import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')
If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.) Pytorch install link
A similar case will happen also for Tensorflow/Keras.
Solution 7 - Python
If you are getting this error in Google Colab use this code:
import torch
torch.cuda.empty_cache()
Solution 8 - Python
Problem solved by the following code:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'
Solution 9 - Python
If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders
can be controlled via bs=N
where N is the size of the batch.
My dedicated GPU is limited to 2GB of memory, using bs=8
in the following example worked in my situation:
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
Solution 10 - Python
I faced the same issue with my computer. All you have to do is customize your cfg file that suits your computer.Turns out my computer takes image size below 600 X 600 and when I adjusted the same in config file, the program ran smoothly.Picture Describing my cfg file