What is the canonical way to check for errors using the CUDA runtime API?

Cuda Problem Overview

Looking through the answers and comments on CUDA questions, and in the CUDA tag wiki, I see it is often suggested that the return status of every API call should checked for errors. The API documentation contains functions like cudaGetLastError, cudaPeekAtLastError, and cudaGetErrorString, but what is the best way to put these together to reliably catch and report errors without requiring lots of extra code?

Cuda Solutions

Solution 1 - Cuda

Probably the best way to check for errors in runtime API code is to define an assert style handler function and wrapper macro like this:

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

You can then wrap each API call with the gpuErrchk macro, which will process the return status of the API call it wraps, for example:

gpuErrchk( cudaMalloc(&a_d, size*sizeof(int)) );

If there is an error in a call, a textual message describing the error and the file and line in your code where the error occurred will be emitted to stderr and the application will exit. You could conceivably modify gpuAssert to raise an exception rather than call exit() in a more sophisticated application if it were required.

A second related question is how to check for errors in kernel launches, which can't be directly wrapped in a macro call like standard runtime API calls. For kernels, something like this:

kernel<<<1,1>>>(a);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );

will firstly check for invalid launch argument, then force the host to wait until the kernel stops and checks for an execution error. The synchronisation can be eliminated if you have a subsequent blocking API call like this:

kernel<<<1,1>>>(a_d);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaMemcpy(a_h, a_d, size * sizeof(int), cudaMemcpyDeviceToHost) );

in which case the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself. This can be confusing for the beginner, and I would recommend using explicit synchronisation after a kernel launch during debugging to make it easier to understand where problems might be arising.

Note that when using CUDA Dynamic Parallelism, a very similar methodology can and should be applied to any usage of the CUDA runtime API in device kernels, as well as after any device kernel launches:

#include <assert.h>
#define cdpErrchk(ans) { cdpAssert((ans), __FILE__, __LINE__); }
__device__ void cdpAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      printf("GPU kernel assert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) assert(0);
   }
}

Solution 2 - Cuda

talonmies' answer above is a fine way to abort an application in an assert-style manner.

Occasionally we may wish to report and recover from an error condition in a C++ context as part of a larger application.

Here's a reasonably terse way to do that by throwing a C++ exception derived from std::runtime_error using thrust::system_error:

#include <thrust/system_error.h>
#include <thrust/system/cuda/error.h>
#include <sstream>

void throw_on_cuda_error(cudaError_t code, const char *file, int line)
{
  if(code != cudaSuccess)
  {
    std::stringstream ss;
    ss << file << "(" << line << ")";
    std::string file_and_line;
    ss >> file_and_line;
    throw thrust::system_error(code, thrust::cuda_category(), file_and_line);
  }
}

This will incorporate the filename, line number, and an English language description of the cudaError_t into the thrown exception's .what() member:

#include <iostream>

int main()
{
  try
  {
    // do something crazy
    throw_on_cuda_error(cudaSetDevice(-1), __FILE__, __LINE__);
  }
  catch(thrust::system_error &e)
  {
    std::cerr << "CUDA error after cudaSetDevice: " << e.what() << std::endl;

    // oops, recover
    cudaSetDevice(0);
  }

  return 0;
}

The output:

$ nvcc exception.cu -run
CUDA error after cudaSetDevice: exception.cu(23): invalid device ordinal

A client of some_function can distinguish CUDA errors from other kinds of errors if desired:

try
{
  // call some_function which may throw something
  some_function();
}
catch(thrust::system_error &e)
{
  std::cerr << "CUDA error during some_function: " << e.what() << std::endl;
}
catch(std::bad_alloc &e)
{
  std::cerr << "Bad memory allocation during some_function: " << e.what() << std::endl;
}
catch(std::runtime_error &e)
{
  std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}
catch(...)
{
  std::cerr << "Some other kind of error during some_function" << std::endl;

  // no idea what to do, so just rethrow the exception
  throw;
}

Because thrust::system_error is a std::runtime_error, we can alternatively handle it in the same manner of a broad class of errors if we don't require the precision of the previous example:

try
{
  // call some_function which may throw something
  some_function();
}
catch(std::runtime_error &e)
{
  std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}

Solution 3 - Cuda

The C++-canonical way: Don't check for errors...use the C++ bindings which throw exceptions.

I used to be irked by this problem; and I used to have a macro-cum-wrapper-function solution just like in Talonmies and Jared's answers, but, honestly? It makes using the CUDA Runtime API even more ugly and C-like.

So I've approached this in a different and more fundamental way. For a sample of the result, here's part of the CUDA vectorAdd sample - with complete error checking of every runtime API call:

// (... prepare host-side buffers here ...)

auto current_device = cuda::device::current::get();
auto d_A = cuda::memory::device::make_unique<float[]>(current_device, numElements);
auto d_B = cuda::memory::device::make_unique<float[]>(current_device, numElements);
auto d_C = cuda::memory::device::make_unique<float[]>(current_device, numElements);

cuda::memory::copy(d_A.get(), h_A.get(), size);
cuda::memory::copy(d_B.get(), h_B.get(), size);

auto launch_config = cuda::launch_config_builder()
	.overall_size(numElements)
	.block_size(256)
	.build();

cuda::launch(vectorAdd, launch_config,
    d_A.get(), d_B.get(), d_C.get(), numElements
);    
cuda::memory::copy(h_C.get(), d_C.get(), size);

// (... verify results here...)

Again - all potential errors are checked , and an exception if an error occurred (caveat: If the kernel caused some error after launch, it will be caught after the attempt to copy the result, not before; to ensure the kernel was successful you would need to check for error between the launch and the copy with a cuda::outstanding_error::ensure_none() command).

The code above uses my

Thin Modern-C++ wrappers for the CUDA Runtime API library (Github)

Note that the exceptions carry both a string explanation and the CUDA runtime API status code after the failing call.

A few links to how CUDA errors are automagically checked with these wrappers:

Solution 4 - Cuda

The solution discussed here worked well for me. This solution uses built-in cuda functions and is very simple to implement.

The relevant code is copied below:

#include <stdio.h>
#include <stdlib.h>

__global__ void foo(int *ptr)
{
  *ptr = 7;
}

int main(void)
{
  foo<<<1,1>>>(0);

  // make the host block until the device is finished with foo
  cudaDeviceSynchronize();

  // check for error
  cudaError_t error = cudaGetLastError();
  if(error != cudaSuccess)
  {
    // print the CUDA error message and exit
    printf("CUDA error: %s\n", cudaGetErrorString(error));
    exit(-1);
  }

  return 0;
}

Content Type	Original Author	Original Content on Stackoverflow
Question	talonmies	View Question on Stackoverflow
Solution 1 - Cuda	talonmies	View Answer on Stackoverflow
Solution 2 - Cuda	Jared Hoberock	View Answer on Stackoverflow
Solution 3 - Cuda	einpoklum	View Answer on Stackoverflow
Solution 4 - Cuda	jthomas	View Answer on Stackoverflow