Combining CUDA with Thrust: Simplifying Parallel Development
Introduction
GPU computing has revolutionized numerous fields where massive parallelism can significantly accelerate computations—from physics simulations, to image processing, to deep learning. NVIDIA’s CUDA (Compute Unified Device Architecture) provides a flexible and powerful model for programming GPUs. While CUDA’s flexibility is a major advantage, the lower-level APIs can be daunting for newcomers. Fortunately, NVIDIA’s Thrust library simplifies many GPU operations by offering a high-level interface for common parallel algorithms and data structures.
This blog post walks you step-by-step through combining CUDA with Thrust, starting from the basics and moving on to advanced applications. Whether you are just getting started or expanding an existing project to incorporate professional-level parallel solutions, this article offers guidance on effective strategies, memory management, and performance tuning.
Table of Contents
- CUDA Fundamentals
- Introducing Thrust
- Setting up Your Environment
- Basic Thrust Operations
- Memory Management in Thrust
- Sorting with Thrust
- Transformations and Custom Functors
- Combining Thrust with Custom CUDA Kernels
- Performance Considerations
- Advanced Topics
- Conclusion
CUDA Fundamentals
What is CUDA?
CUDA is a parallel computing platform created by NVIDIA. It grants developers direct access to the GPU’s virtual instruction set and memory, allowing the creation of massively parallel algorithms. With CUDA, you write “kernels,” which are functions executed on the GPU in parallel by thousands—or even millions—of lightweight threads. These kernels are compiled by NVCC (NVIDIA’s compiler), enabling them to run on compatible NVIDIA GPUs.
Core Concepts
- Kernels: Functions executed by multiple threads in parallel.
- Threads: The smallest unit of parallelism on the GPU.
- Blocks: Groups of threads arranged in 1D, 2D, or 3D.
- Grid: A collection of blocks that execute a kernel.
- Memory Hierarchy: Registers, shared memory, global memory, constant memory, and texture memory.
For smaller parallel tasks, you can often rely more heavily on libraries, like Thrust, to handle the “under the hood” complexities. But understanding how CUDA is structured will help you diagnose performance bottlenecks and write custom kernels if required.
Why CUDA?
• Massive Parallelism: GPUs excel at tasks that can be parallelized over many threads.
• Mature Ecosystem: The CUDA ecosystem has grown to include an abundance of libraries, debugging tools, and documentation.
• Flexibility: Even if you use high-level abstractions like Thrust, you can always write custom kernels for maximum control.
Introducing Thrust
What is Thrust?
Thrust is a C++ template library for CUDA-based parallel programming. Modeled after the Standard Template Library (STL), Thrust offers a familiar interface for algorithms like sort
, reduce
, transform
, scan
, and more. By providing high-level functions that map to efficient CUDA kernels, Thrust drastically reduces the amount of boilerplate code you need to write, letting you focus on your specific problem rather than low-level details.
Key Features
- STL-Like Interface: Use
thrust::vector
,thrust::sort
, etc. in a manner similar to standard C++ containers and algorithms. - Host/Device Vectors: Manage memory automatically on the host (CPU) or device (GPU).
- Portability: Thrust backends include CUDA, OpenMP, and TBB (Threading Building Blocks). You can switch between them by modifying compiler flags.
- Integration with CUDA: You can interleave Thrust calls with custom CUDA kernels.
Benefits over Pure CUDA
• Less Boilerplate: Thrust reduces the overhead of writing your own allocation, copying, and synchronization code.
• Concise Code: STL-like syntax keeps your code maintainable and easier to read.
• Performance: Thrust’s algorithms are already optimized, and many routines are highly efficient.
By mastering Thrust, you can achieve a good balance of productivity and performance—often without needing to write custom CUDA kernels. Still, when specialized operations are necessary, you can freely mix in your custom kernels.
Setting up Your Environment
Before diving into examples, ensure you have a proper development environment:
- NVIDIA GPU: A CUDA-compatible graphics card with supported drivers.
- CUDA Toolkit: Includes the NVCC compiler and the necessary libraries. Make sure to install the appropriate version for your system.
- C++ Compiler: Required to compile host code and call NVCC for device code. Common choices include GCC or Clang on Linux, and MSVC on Windows.
- IDE or Build System: While you can compile from the command line, an IDE like Visual Studio or CLion can simplify the development process. Alternatively, use a build system like CMake.
Sample Build Command
If you have a file named thrust_example.cu
, you might compile it with a command like:
nvcc thrust_example.cu -o thrust_example
To enable different backends or advanced features, refer to Thrust documentation or specify flags like -Xcompiler -fopenmp
(for OpenMP) or -std=c++14
for modern C++ features.
Basic Thrust Operations
One of Thrust’s most significant advantages is its STL-like interface. Let’s look at some core operations that will get you up and running quickly.
The Thrust Vector
Thrust provides both host and device vector types:
thrust::host_vector<T>
thrust::device_vector<T>
They behave similarly to std::vector<T>
, but are specialized to host and device memory, respectively.
Simple Vector Example
Below is a very simple example demonstrating the creation of a device_vector
, initialization, and printing. For printing, we usually copy the data back to the host (or manage an alternative approach such as using host iterators).
#include <thrust/device_vector.h>#include <thrust/host_vector.h>#include <iostream>
int main() { // Create device_vector of size 5 thrust::device_vector<int> d_vec(5);
// Fill the vector with values 0, 1, 2, 3, 4 for (int i = 0; i < 5; i++) { d_vec[i] = i; }
// Transfer data to host thrust::host_vector<int> h_vec = d_vec;
// Print on the host for (int i = 0; i < 5; i++) { std::cout << "Element " << i << ": " << h_vec[i] << std::endl; }
return 0;}
Host-to-Device Transfers
Notice that vectors automatically manage the memory allocation and transfers for you. When you assign a device_vector
to a host_vector
, Thrust handles copying data from the GPU to the CPU behind the scenes.
Memory Management in Thrust
Thrust offers a simplified approach to memory management, but understanding the underlying concepts is important, especially when mixing in custom CUDA kernels.
Host and Device Vectors
As mentioned, Thrust provides:
thrust::host_vector<T>
: Stored in (and accessible from) CPU memory.thrust::device_vector<T>
: Stored in GPU device memory.
Raw Pointers and Iterators
When advanced control is needed, Thrust vectors can give you raw pointers (device pointers) via the data()
function:
thrust::device_vector<float> d_vec(100);
// Get a raw pointer to the datafloat* raw_ptr = thrust::raw_pointer_cast(d_vec.data());
// Pass raw_ptr to custom CUDA kernels, if necessary
This pointer can be used in a kernel to operate on the data. However, you need to ensure synchronization and that the pointer remains valid (i.e., the device_vector
persists throughout the kernel operation).
When to Use Thrust vs. Raw CUDA Memory
In many cases, Thrust is sufficient. However, you might prefer raw CUDA memory when you:
- Need advanced memory allocation patterns, or want to use pinned or unified memory.
- Require maximum control for performance-tuning.
- Have custom memory pooling logic.
Still, for most use cases, device_vector
is an excellent, straightforward choice.
Sorting with Thrust
Sorting is one of the most common operations in data processing. Thrust’s sort
function provides a powerful GPU-accelerated sort without writing your own kernel.
Example: Thrust Sort
#include <thrust/device_vector.h>#include <thrust/sort.h>#include <iostream>
int main() { // Create and fill a device vector thrust::device_vector<int> d_vec(5); d_vec[0] = 30; d_vec[1] = 10; d_vec[2] = 50; d_vec[3] = 20; d_vec[4] = 40;
// Sort in ascending order thrust::sort(d_vec.begin(), d_vec.end());
// Copy back to host and print thrust::host_vector<int> h_vec = d_vec; for(int i = 0; i < h_vec.size(); i++){ std::cout << h_vec[i] << " "; } std::cout << std::endl;
return 0;}
The code above:
- Declares a
device_vector
of integers. - Assigns some unsorted values.
- Calls
thrust::sort
. - Copies the sorted data to a host vector for printing.
Sorting in Descending Order
To sort in descending order, you can provide a custom comparator:
thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<int>());
Thrust also includes useful comparators like thrust::less<T>
and thrust::greater<T>
. Or, you can write your own for custom ordering logic.
Sorting Pairs (Key-Value)
If you need to sort keys with associated values, Thrust provides sort_by_key
. For instance:
#include <thrust/device_vector.h>#include <thrust/sort.h>
int main() { thrust::device_vector<int> keys(5); thrust::device_vector<float> values(5);
keys[0] = 4; values[0] = 1.2f; keys[1] = 2; values[1] = 3.4f; keys[2] = 5; values[2] = 0.1f; keys[3] = 1; values[3] = 2.9f; keys[4] = 3; values[4] = 4.5f;
// Sort by key in ascending order thrust::sort_by_key(keys.begin(), keys.end(), values.begin());
// ...}
After this sort, the entries in values
will have been reordered to match the new ordering in keys
.
Transformations and Custom Functors
Overview of Transform
A common parallel operation is applying a function to each element of a vector, generating a corresponding output in either the same or a different data set. In Thrust, use thrust::transform
for this:
#include <thrust/device_vector.h>#include <thrust/transform.h>#include <thrust/functional.h>#include <iostream>
int main() { thrust::device_vector<int> d_input(5); thrust::device_vector<int> d_output(5);
// Initialize input for(int i = 0; i < 5; i++){ d_input[i] = i; }
// Square each element: output[i] = input[i]^2 thrust::transform(d_input.begin(), d_input.end(), d_output.begin(), thrust::square<int>()); // custom or built-in functor
// Copy to host for printing thrust::host_vector<int> h_output = d_output; for(int i = 0; i < 5; i++){ std::cout << h_output[i] << " "; } std::cout << std::endl;
return 0;}
Thrust provides a set of predefined functors in thrust::functional
(like plus<T>
, multiplies<T>
, etc.). You can also define your own.
Writing a Custom Functor
Out-of-the-box functors might not cover everything. Here’s how to define one:
struct multiply_by_constant{ const int c; multiply_by_constant(int _c) : c(_c) {}
__host__ __device__ int operator()(const int& x) const { return x * c; }};
- Mark the functor with both
__host__
and__device__
to ensure it’s callable from GPU code. - In the constructor, save any constants you need.
- Override
operator()
to define your transformation logic.
You can then call thrust::transform
with an instance of your functor:
thrust::transform(d_input.begin(), d_input.end(), d_output.begin(), multiply_by_constant(5));
Combining Thrust with Custom CUDA Kernels
Even with Thrust, there may be times you want a custom kernel. Thrust allows you to interleave your own GPU kernel code with Thrust operations. A typical workflow might be:
- Create or update a
device_vector
using Thrust operations. - Extract a raw pointer with
thrust::raw_pointer_cast
. - Launch a custom kernel.
- Further manipulate or read results with Thrust.
Example Workflow
#include <thrust/device_vector.h>#include <iostream>
// Custom kernel__global__ void addOneKernel(int* d_data, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { d_data[idx] += 1; }}
int main() { // Step 1: Create device_vector thrust::device_vector<int> d_vec(5, 0);
// Step 2: Set initial values using Thrust thrust::sequence(d_vec.begin(), d_vec.end(), 0);
// Step 3: Launch custom kernel int* raw_ptr = thrust::raw_pointer_cast(d_vec.data()); addOneKernel<<<1, 5>>>(raw_ptr, d_vec.size()); cudaDeviceSynchronize();
// Step 4: Print results thrust::host_vector<int> h_vec = d_vec; for (int i = 0; i < h_vec.size(); i++){ std::cout << h_vec[i] << " "; } std::cout << std::endl;
return 0;}
In this example:
- We create
d_vec
of size 5 initialized to zero. - Assign values using
thrust::sequence()
, which populates the vector with consecutive values: 0, 1, 2, 3, 4. - Extract the raw pointer and run a custom kernel that increments each element by 1.
- Finally, we transfer the data back to the host and print.
Performance Considerations
While Thrust usually yields strong performance out of the box, you can improve efficiency in several ways.
Common Bottlenecks
- Memory Transfers: Moving data between host and device is expensive. Aim to minimize frequent copying.
- Kernel Launch Overheads: Too many small kernel launches can reduce performance; consider batching or using thrust algorithms that operate on large chunks.
- Non-Coalesced Memory Access: Ensure data is aligned and accessed in a GPU-friendly way.
Potential Optimizations
- Use Unified Memory (if applicable) for simpler data movement.
- Use Streams to overlap memory transfers with computations.
- Reuse Vectors: Instead of creating/destroying vectors often, resize them as needed.
- Benchmark: Evaluate each step to understand where time is spent.
Thrust Algorithm Complexity
The table below shows complexities for some common Thrust operations:
Operation | Complexity |
---|---|
sort | O(N log N) |
reduce | O(N) |
scan | O(N) |
transform | O(N) |
gather/scatter | O(N) |
These align with typical complexities known from CPU algorithms, but the GPU-based implementations are optimized for parallel performance.
Advanced Topics
Thrust supports many advanced features and patterns. Below are a few that you might explore as your expertise grows.
Scatter and Gather
- Scatter: Writes elements from an input range into a new range at indices specified by a scatter map.
- Gather: The reverse operation of scatter, collecting data from different indices.
Example gather call:
thrust::device_vector<int> source(5);thrust::device_vector<int> map(5);thrust::device_vector<int> result(5);
// Set data// ...
thrust::gather(map.begin(), map.end(), source.begin(), result.begin());
Segmented Reductions
For more complex operations (e.g., summing values within segments of a vector), Thrust offers segmented reduction patterns. You can combine transform_iterator
and reduction algorithms for efficient solutions to these types of partitioned problems.
Custom Iterators
Thrust’s flexible iterator framework includes:
- Permutation Iterators: Access elements in a permuted order without copying data.
- Zip Iterators: Combine multiple ranges into one, effectively creating tuples of elements.
- Transform Iterators: Apply a transformation on-the-fly as you traverse the data.
For instance, a zip_iterator
can help you operate on two vectors in a single pass.
Conclusion
Combining CUDA with Thrust allows you to focus on algorithmic logic rather than re-implementing common parallel patterns. Thrust significantly reduces the complexities of memory management, kernel launches, and performance tuning thanks to its high-level abstractions and optimized backends. With an STL-like interface, you can quickly implement GPU-accelerated algorithms that remain readable, maintainable, and efficient.
Yet, Thrust doesn’t lock you out of advanced control when it’s necessary. Custom kernels can be integrated after a quick extraction of raw pointers, providing a seamless workflow that blends the convenience of Thrust with the power of traditional CUDA.
As you move forward, you can dive further into advanced topics such as custom allocators, specialized iterators, or segmented algorithms. Combined with knowledge of CUDA’s intricacies and memory models, Thrust offers a robust framework for developing high-performance, parallel applications that harness the full potential of modern GPUs. Your journey might start with simple vector operations, but it can scale to sorting millions of elements, performing complex transformations, and handling domain-specific operations—all with concise and powerful code built on CUDA and Thrust.
Happy coding, and welcome to the world of accelerated computing!