How to fix this strange error: "RuntimeError: CUDA error: out of memory"

PythonPytorch

Python Problem Overview


I successfully trained the network but got this error during validation:

> RuntimeError: CUDA error: out of memory

Python Solutions


Solution 1 - Python

The error occurs because you ran out of memory on your GPU.

One way to solve it is to reduce the batch size until your code runs without this error.

Solution 2 - Python

1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under

with torch.no_grad():
    ...
    net=Net()
    pred_for_validation=net(input)
    ...

Above code doesn't use GPU memory

2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

Even if docs guides with float(), in case of me, item() also worked like

entire_loss=0.0
for i in range(100):
    one_loss=loss_function(prediction,label)
    entire_loss+=one_loss.item()

3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()

for one_epoch in range(100):
    ...
    optimizer.step()
    del intermediate_variable1,intermediate_variable2,...

Solution 3 - Python

The best way is to find the process engaging gpu memory and kill it:

find the PID of python process from:

nvidia-smi

copy the PID and kill it by:

sudo kill -9 pid

Solution 4 - Python

I had the same issue and this code worked for me :

import gc

gc.collect()

torch.cuda.empty_cache()

Solution 5 - Python

It might be for a number of reasons that I try to report in the following list:

  1. Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
  2. RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
  3. Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
  4. Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.

In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html

Solution 6 - Python

I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.

Check whether the cause is really due to your GPU memory, by a code below.

import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')

If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.) Pytorch install link

A similar case will happen also for Tensorflow/Keras.

Solution 7 - Python

If you are getting this error in Google Colab use this code:

import torch
torch.cuda.empty_cache()

Solution 8 - Python

Problem solved by the following code:

import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'

Solution 9 - Python

If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.

My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

Solution 10 - Python

I faced the same issue with my computer. All you have to do is customize your cfg file that suits your computer.Turns out my computer takes image size below 600 X 600 and when I adjusted the same in config file, the program ran smoothly.Picture Describing my cfg file

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionxiaoding chenView Question on Stackoverflow
Solution 1 - PythonK. KhandaView Answer on Stackoverflow
Solution 2 - PythonYoungMin ParkView Answer on Stackoverflow
Solution 3 - PythonMilad shiriView Answer on Stackoverflow
Solution 4 - Pythonbehnaz.sheikhiView Answer on Stackoverflow
Solution 5 - PythonAlessandro SugliaView Answer on Stackoverflow
Solution 6 - PythonToru KikuchiView Answer on Stackoverflow
Solution 7 - PythonThemba TmanView Answer on Stackoverflow
Solution 8 - Pythonah bonView Answer on Stackoverflow
Solution 9 - PythondgellowView Answer on Stackoverflow
Solution 10 - Pythonnimish naharView Answer on Stackoverflow