<Cuda> templates

This document gives a brief introduction into memory-related CUDA features and discusses benefits provided by the CUDA template classes.

1. Memory types

NVIDIA's GPUs support various types of memory, each optimized for a particular access pattern. The CUDA toolkit provides functions for allocating memory of each type and for copying data between the different memory types (including host/device data transfer). Only issues relevant to the CUDA templates are discussed here, please see the CUDA Programming Guide for more information.

Host memory (CPU)

Heap memory is regular pageable host memory allocated by malloc(). CUDA supports copying data between heap memory and any type of device memory, though performance is not optimal.
Page-locked host memory is mapped into the address space of the GPU and can be accessed directly, thus increasing bandwidth. Moreover, some devices can also perform copies between page-locked host memory and device memory concurrently with kernel execution. Page-locked host memory is therefore best used as a data exchange area for data which need to be updated frequently.

Device memory (GPU)

The dynamic memory on the graphics card is available to the GPU via its linear device memory interface. It is not cached, but highly optimized for stream processing, therefore irregular access patterns significantly degrade performance. A relatively small amount of static shared memory is available to overcome the restrictions of linear memory. However, since shared memory cannot be allocated dynamically, and the host cannot access the GPU's shared memory, there is no CUDA templates representation of it.
Performance depends on well aligned memory addresses. To ensure proper alignment of more-dimensional data, several pitched device memory functions are available. These automatically insert sufficient bytes of padding (e.g., at the end of each row in an image such that the next row starts at a well aligned memory address).
A CUDA array is used to access the GPU's texture units in CUDA. Texture units support various optimizations useful for accessing images such as linear interpolation and caching.
A separate region of constant memory is available for program parameters and similar purposes. It is cached and therefore well suited for frequently accessed constant data without consuming register space.

OpenGL interoperability

CUDA supports direct data exchange between OpenGL and CUDA applications. This typically involves access to an OpenGL buffer object, which can be mapped into the address space of a CUDA application. Buffer objects can be used in various ways in OpenGL, e.g., as a data source in texture creation commands. For this purpose, an OpenGL texture class is included in the CUDA templates.

2. Consistent interface

Access to all above-mentioned memory types is implemented in the CUDA toolkit. However, function signatures largely differ depending on the particular memory type involved. Moreover, the size of the data to be processed is given in bytes for some functions and in terms of elements (e.g., floats) for other functions. Error conditions must be checked explicitly, otherwise the program continues in an undefined state.

The main goal of the CUDA templates is to provide a clean and consistent interface to the underlying functions of the CUDA toolkit. Each of the different memory types is represented as a class template parameterized by the element data type and the dimension (most of them with specializations for one, two, and three dimensions). Since the data type (i.e., the type of memory to be accessed) is known at compile time, all the details about different data access methods are left to the compiler, therefore a single template function copy() can handle all possible memory transactions. Other benefits of the object-oriented approach are the mapping of CUDA errors to corresponding exceptions and the automatic deallocation of resources in the class destructors.

3. Image libraries

To simplify integration of CUDA with existing applications, the CUDA templates include compatibility classes for the following image libraries:

Insight Toolkit by Kitware, Inc.
Generic Image Library distributed with boost
OpenCV

Image data created by one of these libraries can be used with the CUDA templates in the same way as data natively allocated by some CUDA template class. The CUDA templates do not define their own image I/O methods, but instead allow the programmer to use his favourite library for this purpose (or easily add integration for other libraries if none of the above matches).