Combining CUDA with Thrust: Simplifying Parallel Development#

Introduction#

GPU computing has revolutionized numerous fields where massive parallelism can significantly accelerate computations—from physics simulations, to image processing, to deep learning. NVIDIA’s CUDA (Compute Unified Device Architecture) provides a flexible and powerful model for programming GPUs. While CUDA’s flexibility is a major advantage, the lower-level APIs can be daunting for newcomers. Fortunately, NVIDIA’s Thrust library simplifies many GPU operations by offering a high-level interface for common parallel algorithms and data structures.

This blog post walks you step-by-step through combining CUDA with Thrust, starting from the basics and moving on to advanced applications. Whether you are just getting started or expanding an existing project to incorporate professional-level parallel solutions, this article offers guidance on effective strategies, memory management, and performance tuning.

Table of Contents#

CUDA Fundamentals
Introducing Thrust
Setting up Your Environment
Basic Thrust Operations
Memory Management in Thrust
Sorting with Thrust
Transformations and Custom Functors
Combining Thrust with Custom CUDA Kernels
Performance Considerations
Advanced Topics
Conclusion

CUDA Fundamentals#

What is CUDA?#

CUDA is a parallel computing platform created by NVIDIA. It grants developers direct access to the GPU’s virtual instruction set and memory, allowing the creation of massively parallel algorithms. With CUDA, you write “kernels,” which are functions executed on the GPU in parallel by thousands—or even millions—of lightweight threads. These kernels are compiled by NVCC (NVIDIA’s compiler), enabling them to run on compatible NVIDIA GPUs.

Core Concepts#

Kernels: Functions executed by multiple threads in parallel.
Threads: The smallest unit of parallelism on the GPU.
Blocks: Groups of threads arranged in 1D, 2D, or 3D.
Grid: A collection of blocks that execute a kernel.
Memory Hierarchy: Registers, shared memory, global memory, constant memory, and texture memory.

For smaller parallel tasks, you can often rely more heavily on libraries, like Thrust, to handle the “under the hood” complexities. But understanding how CUDA is structured will help you diagnose performance bottlenecks and write custom kernels if required.

Why CUDA?#

• Massive Parallelism: GPUs excel at tasks that can be parallelized over many threads.
• Mature Ecosystem: The CUDA ecosystem has grown to include an abundance of libraries, debugging tools, and documentation.
• Flexibility: Even if you use high-level abstractions like Thrust, you can always write custom kernels for maximum control.

Introducing Thrust#

What is Thrust?#

Thrust is a C++ template library for CUDA-based parallel programming. Modeled after the Standard Template Library (STL), Thrust offers a familiar interface for algorithms like sort, reduce, transform, scan, and more. By providing high-level functions that map to efficient CUDA kernels, Thrust drastically reduces the amount of boilerplate code you need to write, letting you focus on your specific problem rather than low-level details.

Key Features#

STL-Like Interface: Use thrust::vector, thrust::sort, etc. in a manner similar to standard C++ containers and algorithms.
Host/Device Vectors: Manage memory automatically on the host (CPU) or device (GPU).
Portability: Thrust backends include CUDA, OpenMP, and TBB (Threading Building Blocks). You can switch between them by modifying compiler flags.
Integration with CUDA: You can interleave Thrust calls with custom CUDA kernels.

Benefits over Pure CUDA#

• Less Boilerplate: Thrust reduces the overhead of writing your own allocation, copying, and synchronization code.
• Concise Code: STL-like syntax keeps your code maintainable and easier to read.
• Performance: Thrust’s algorithms are already optimized, and many routines are highly efficient.

By mastering Thrust, you can achieve a good balance of productivity and performance—often without needing to write custom CUDA kernels. Still, when specialized operations are necessary, you can freely mix in your custom kernels.

Setting up Your Environment#

Before diving into examples, ensure you have a proper development environment:

NVIDIA GPU: A CUDA-compatible graphics card with supported drivers.
CUDA Toolkit: Includes the NVCC compiler and the necessary libraries. Make sure to install the appropriate version for your system.
C++ Compiler: Required to compile host code and call NVCC for device code. Common choices include GCC or Clang on Linux, and MSVC on Windows.
IDE or Build System: While you can compile from the command line, an IDE like Visual Studio or CLion can simplify the development process. Alternatively, use a build system like CMake.

Sample Build Command#

If you have a file named thrust_example.cu, you might compile it with a command like:

1
nvcc thrust_example.cu -o thrust_example

To enable different backends or advanced features, refer to Thrust documentation or specify flags like -Xcompiler -fopenmp (for OpenMP) or -std=c++14 for modern C++ features.

Basic Thrust Operations#

One of Thrust’s most significant advantages is its STL-like interface. Let’s look at some core operations that will get you up and running quickly.

The Thrust Vector#

Thrust provides both host and device vector types:

thrust::host_vector<T>
thrust::device_vector<T>

They behave similarly to std::vector<T>, but are specialized to host and device memory, respectively.

Simple Vector Example#

Below is a very simple example demonstrating the creation of a device_vector, initialization, and printing. For printing, we usually copy the data back to the host (or manage an alternative approach such as using host iterators).

1
#include <thrust/device_vector.h>
2
#include <thrust/host_vector.h>
3
#include <iostream>
4

5
int main() {
6
    // Create device_vector of size 5
7
    thrust::device_vector<int> d_vec(5);
8

9
    // Fill the vector with values 0, 1, 2, 3, 4
10
    for (int i = 0; i < 5; i++) {
11
        d_vec[i] = i;
12
    }
13

14
    // Transfer data to host
15
    thrust::host_vector<int> h_vec = d_vec;
16

17
    // Print on the host
18
    for (int i = 0; i < 5; i++) {
19
        std::cout << "Element " << i << ": " << h_vec[i] << std::endl;
20
    }
21

22
    return 0;
23
}

Host-to-Device Transfers#

Notice that vectors automatically manage the memory allocation and transfers for you. When you assign a device_vector to a host_vector, Thrust handles copying data from the GPU to the CPU behind the scenes.

Memory Management in Thrust#

Thrust offers a simplified approach to memory management, but understanding the underlying concepts is important, especially when mixing in custom CUDA kernels.

Host and Device Vectors#

As mentioned, Thrust provides:

thrust::host_vector<T>: Stored in (and accessible from) CPU memory.
thrust::device_vector<T>: Stored in GPU device memory.

Raw Pointers and Iterators#

When advanced control is needed, Thrust vectors can give you raw pointers (device pointers) via the data() function:

1
thrust::device_vector<float> d_vec(100);
2

3
// Get a raw pointer to the data
4
float* raw_ptr = thrust::raw_pointer_cast(d_vec.data());
5

6
// Pass raw_ptr to custom CUDA kernels, if necessary

This pointer can be used in a kernel to operate on the data. However, you need to ensure synchronization and that the pointer remains valid (i.e., the device_vector persists throughout the kernel operation).

When to Use Thrust vs. Raw CUDA Memory#

In many cases, Thrust is sufficient. However, you might prefer raw CUDA memory when you:

Need advanced memory allocation patterns, or want to use pinned or unified memory.
Require maximum control for performance-tuning.
Have custom memory pooling logic.

Still, for most use cases, device_vector is an excellent, straightforward choice.

Sorting with Thrust#

Sorting is one of the most common operations in data processing. Thrust’s sort function provides a powerful GPU-accelerated sort without writing your own kernel.

Example: Thrust Sort#

1
#include <thrust/device_vector.h>
2
#include <thrust/sort.h>
3
#include <iostream>
4

5
int main() {
6
    // Create and fill a device vector
7
    thrust::device_vector<int> d_vec(5);
8
    d_vec[0] = 30;
9
    d_vec[1] = 10;
10
    d_vec[2] = 50;
11
    d_vec[3] = 20;
12
    d_vec[4] = 40;
13

14
    // Sort in ascending order
15
    thrust::sort(d_vec.begin(), d_vec.end());
16

17
    // Copy back to host and print
18
    thrust::host_vector<int> h_vec = d_vec;
19
    for(int i = 0; i < h_vec.size(); i++){
20
        std::cout << h_vec[i] << " ";
21
    }
22
    std::cout << std::endl;
23

24
    return 0;
25
}

The code above:

Declares a device_vector of integers.
Assigns some unsorted values.
Calls thrust::sort.
Copies the sorted data to a host vector for printing.

Sorting in Descending Order#

To sort in descending order, you can provide a custom comparator:

1
thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<int>());

Thrust also includes useful comparators like thrust::less<T> and thrust::greater<T>. Or, you can write your own for custom ordering logic.

Sorting Pairs (Key-Value)#

If you need to sort keys with associated values, Thrust provides sort_by_key. For instance:

1
#include <thrust/device_vector.h>
2
#include <thrust/sort.h>
3

4
int main() {
5
    thrust::device_vector<int> keys(5);
6
    thrust::device_vector<float> values(5);
7

8
    keys[0] = 4;   values[0] = 1.2f;
9
    keys[1] = 2;   values[1] = 3.4f;
10
    keys[2] = 5;   values[2] = 0.1f;
11
    keys[3] = 1;   values[3] = 2.9f;
12
    keys[4] = 3;   values[4] = 4.5f;
13

14
    // Sort by key in ascending order
15
    thrust::sort_by_key(keys.begin(), keys.end(), values.begin());
16

17
    // ...
18
}

After this sort, the entries in values will have been reordered to match the new ordering in keys.

Transformations and Custom Functors#

Overview of Transform#

A common parallel operation is applying a function to each element of a vector, generating a corresponding output in either the same or a different data set. In Thrust, use thrust::transform for this:

1
#include <thrust/device_vector.h>
2
#include <thrust/transform.h>
3
#include <thrust/functional.h>
4
#include <iostream>
5

6
int main() {
7
    thrust::device_vector<int> d_input(5);
8
    thrust::device_vector<int> d_output(5);
9

10
    // Initialize input
11
    for(int i = 0; i < 5; i++){
12
        d_input[i] = i;
13
    }
14

15
    // Square each element: output[i] = input[i]^2
16
    thrust::transform(d_input.begin(),
17
                      d_input.end(),
18
                      d_output.begin(),
19
                      thrust::square<int>()); // custom or built-in functor
20

21
    // Copy to host for printing
22
    thrust::host_vector<int> h_output = d_output;
23
    for(int i = 0; i < 5; i++){
24
        std::cout << h_output[i] << " ";
25
    }
26
    std::cout << std::endl;
27

28
    return 0;
29
}

Thrust provides a set of predefined functors in thrust::functional (like plus<T>, multiplies<T>, etc.). You can also define your own.

Writing a Custom Functor#

Out-of-the-box functors might not cover everything. Here’s how to define one:

1
struct multiply_by_constant
2
{
3
    const int c;
4
    multiply_by_constant(int _c) : c(_c) {}
5

6
    __host__ __device__
7
    int operator()(const int& x) const {
8
        return x * c;
9
    }
10
};

Mark the functor with both __host__ and __device__ to ensure it’s callable from GPU code.
In the constructor, save any constants you need.
Override operator() to define your transformation logic.

You can then call thrust::transform with an instance of your functor:

1
thrust::transform(d_input.begin(),
2
                  d_input.end(),
3
                  d_output.begin(),
4
                  multiply_by_constant(5));

Combining Thrust with Custom CUDA Kernels#

Even with Thrust, there may be times you want a custom kernel. Thrust allows you to interleave your own GPU kernel code with Thrust operations. A typical workflow might be:

Create or update a device_vector using Thrust operations.
Extract a raw pointer with thrust::raw_pointer_cast.
Launch a custom kernel.
Further manipulate or read results with Thrust.

Example Workflow#

1
#include <thrust/device_vector.h>
2
#include <iostream>
3

4
// Custom kernel
5
__global__ void addOneKernel(int* d_data, int n) {
6
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
7
    if (idx < n) {
8
        d_data[idx] += 1;
9
    }
10
}
11

12
int main() {
13
    // Step 1: Create device_vector
14
    thrust::device_vector<int> d_vec(5, 0);
15

16
    // Step 2: Set initial values using Thrust
17
    thrust::sequence(d_vec.begin(), d_vec.end(), 0);
18

19
    // Step 3: Launch custom kernel
20
    int* raw_ptr = thrust::raw_pointer_cast(d_vec.data());
21
    addOneKernel<<<1, 5>>>(raw_ptr, d_vec.size());
22
    cudaDeviceSynchronize();
23

24
    // Step 4: Print results
25
    thrust::host_vector<int> h_vec = d_vec;
26
    for (int i = 0; i < h_vec.size(); i++){
27
        std::cout << h_vec[i] << " ";
28
    }
29
    std::cout << std::endl;
30

31
    return 0;
32
}

In this example:

We create d_vec of size 5 initialized to zero.
Assign values using thrust::sequence(), which populates the vector with consecutive values: 0, 1, 2, 3, 4.
Extract the raw pointer and run a custom kernel that increments each element by 1.
Finally, we transfer the data back to the host and print.

Performance Considerations#

While Thrust usually yields strong performance out of the box, you can improve efficiency in several ways.

Common Bottlenecks#

Memory Transfers: Moving data between host and device is expensive. Aim to minimize frequent copying.
Kernel Launch Overheads: Too many small kernel launches can reduce performance; consider batching or using thrust algorithms that operate on large chunks.
Non-Coalesced Memory Access: Ensure data is aligned and accessed in a GPU-friendly way.

Potential Optimizations#

Use Unified Memory (if applicable) for simpler data movement.
Use Streams to overlap memory transfers with computations.
Reuse Vectors: Instead of creating/destroying vectors often, resize them as needed.
Benchmark: Evaluate each step to understand where time is spent.

Thrust Algorithm Complexity#

The table below shows complexities for some common Thrust operations:

Operation	Complexity
sort	O(N log N)
reduce	O(N)
scan	O(N)
transform	O(N)
gather/scatter	O(N)

These align with typical complexities known from CPU algorithms, but the GPU-based implementations are optimized for parallel performance.

Advanced Topics#

Thrust supports many advanced features and patterns. Below are a few that you might explore as your expertise grows.

Scatter and Gather#

Scatter: Writes elements from an input range into a new range at indices specified by a scatter map.
Gather: The reverse operation of scatter, collecting data from different indices.

Example gather call:

1
thrust::device_vector<int> source(5);
2
thrust::device_vector<int> map(5);
3
thrust::device_vector<int> result(5);
4

5
// Set data
6
// ...
7

8
thrust::gather(map.begin(), map.end(), source.begin(), result.begin());

Segmented Reductions#

For more complex operations (e.g., summing values within segments of a vector), Thrust offers segmented reduction patterns. You can combine transform_iterator and reduction algorithms for efficient solutions to these types of partitioned problems.

Custom Iterators#

Thrust’s flexible iterator framework includes:

Permutation Iterators: Access elements in a permuted order without copying data.
Zip Iterators: Combine multiple ranges into one, effectively creating tuples of elements.
Transform Iterators: Apply a transformation on-the-fly as you traverse the data.

For instance, a zip_iterator can help you operate on two vectors in a single pass.

Conclusion#

Combining CUDA with Thrust allows you to focus on algorithmic logic rather than re-implementing common parallel patterns. Thrust significantly reduces the complexities of memory management, kernel launches, and performance tuning thanks to its high-level abstractions and optimized backends. With an STL-like interface, you can quickly implement GPU-accelerated algorithms that remain readable, maintainable, and efficient.

Yet, Thrust doesn’t lock you out of advanced control when it’s necessary. Custom kernels can be integrated after a quick extraction of raw pointers, providing a seamless workflow that blends the convenience of Thrust with the power of traditional CUDA.

As you move forward, you can dive further into advanced topics such as custom allocators, specialized iterators, or segmented algorithms. Combined with knowledge of CUDA’s intricacies and memory models, Thrust offers a robust framework for developing high-performance, parallel applications that harness the full potential of modern GPUs. Your journey might start with simple vector operations, but it can scale to sorting millions of elements, performing complex transformations, and handling domain-specific operations—all with concise and powerful code built on CUDA and Thrust.

Happy coding, and welcome to the world of accelerated computing!