From Threads to Blocks: Fundamental CUDA Concepts Explained
Welcome to this comprehensive guide on understanding CUDA’s core concepts—from the smallest unit of computation (the thread) all the way to large-scale GPU grid structures. This blog post aims to walk you step by step through the basics of GPU programming, demystify essential terminologies, and provide practical code snippets. Whether you’re new to CUDA or looking to refine your understanding, this is the place to start.
Table of Contents
- Why GPU Computing?
- A First Look at CUDA
- Threads and Warps
- Blocks: Grouping Threads for Parallel Execution
- Grids: Organizing the Execution Space
- Memory Hierarchy
- Launching a Kernel
- A Practical Example: Vector Addition
- Shared Memory for Fast Communication
- Synchronization and Atomic Operations
- Streams and Concurrency
- Texture and Constant Memory
- Advanced Concepts: Dynamic Parallelism and Unified Memory
- Performance Optimization and Profiling
- Conclusion and Further Reading
Why GPU Computing?
Traditionally, computations have run on CPUs. Modern CPUs outperform their predecessors by increasing clock speed and adding multiple cores. However, GPUs (Graphics Processing Units) have taken a different route: massive parallelism. Instead of a few powerful cores, GPUs include hundreds or even thousands of simpler cores capable of handling a large number of concurrent threads.
This design is especially well-suited for tasks that can be broken down into parallel workloads—such as graphics rendering, matrix multiplication, and many scientific simulations. By offloading compute-intensive tasks to a GPU, developers often achieve speed-ups measured in multiples (or even orders of magnitude) compared to running on a CPU alone.
A First Look at CUDA
CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that exposes GPU functionality for general-purpose computing. It extends C/C++ (and other languages) with keywords and constructs dedicated to parallel execution.
Here are a few fundamental ideas in CUDA:
- Host vs. Device: The CPU is commonly referred to as the “host,” while the GPU is called the “device.”
- Kernels: Special functions, qualified with the
__global__
keyword, that run on the GPU. When you launch a kernel, you spawn many parallel threads on the device. - Thread Hierarchies: You define how many threads to create, how they are grouped into blocks, and how those blocks form a grid.
By understanding threads, blocks, and grids, you can effectively harness the computational power of modern GPUs.
Threads and Warps
Threads: The Smallest Unit
A thread is the smallest unit of execution on a GPU. Each thread runs a particular instance of a kernel. Compared to a CPU thread, GPU threads are more lightweight, and you can have thousands or millions of them active at once.
Warps: A Hardware Concept
When you request a certain number of threads, the GPU hardware will schedule them in groups called warps (typically of size 32 threads on NVIDIA GPUs). All threads in a warp execute the same instruction simultaneously (SIMT, or Single Instruction, Multiple Threads). Divergence within a warp (e.g., divergent if
statements) can reduce efficiency.
Blocks: Grouping Threads for Parallel Execution
Why Blocks Matter
Threads are grouped into blocks. A block is an array (1D, 2D, or 3D) of threads, and it provides:
- A rich set of thread indexing capabilities.
- Shared memory for better data sharing among threads in the same block.
- Synchronization mechanisms such as
__syncthreads()
.
Thread Indexing Within a Block
Each thread within a block has an ID accessible via CUDA built-in variables like:
threadIdx.x
,threadIdx.y
,threadIdx.z
(the thread’s coordinate within the block).blockDim.x
,blockDim.y
,blockDim.z
(the block’s size along each dimension).
Typically, you compute a global index when accessing data in memory:
__global__ void myKernel(float *data) { // Compute the global thread index for a 1D grid int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x; float value = data[globalThreadId]; // ... do something with value}
In this example, globalThreadId
uniquely identifies each thread across the entire grid. For multi-dimensional scenarios, you extend this logic with blockIdx.y
, threadIdx.y
, and so on.
Block Size Considerations
Selecting the right block size is crucial for performance. Key tips:
- Typically, you want each block to have a number of threads that is a multiple of the warp size (32).
- Common block sizes include 128, 256, or 512 threads per block.
- The maximum number of threads per block is GPU-dependent (up to 1024 on many modern GPUs).
Grids: Organizing the Execution Space
While a block represents a cluster of threads, a grid consists of one or more blocks. Like blocks, a grid can be 1D, 2D, or 3D.
Grid Dimensions
gridDim.x
,gridDim.y
, andgridDim.z
store the number of blocks along each dimension.blockIdx.x
,blockIdx.y
, andblockIdx.z
identify the block’s position within the grid.
Conceptually:
- You define how many blocks you want in your grid.
- You define how many threads go in each block.
For instance, suppose you have 1024 elements to process and decide to run 256 threads per block. That means you need 4 blocks in total for a 1D arrangement:
dim3 blocks(4);dim3 threads(256);myKernel<<<blocks, threads>>>(deviceData);
Here, gridDim.x = 4
, blockDim.x = 256
, and total threads = 4 * 256 = 1024
.
Memory Hierarchy
CUDA exposes several memory spaces with different performance characteristics. Understanding these is vital to writing efficient code.
Memory Space | Scope | Access Time | Typical Usage |
---|---|---|---|
Global Memory | All threads in the grid | High latency | Largest space; data typically resides here |
Shared Memory | Threads within a block | Low latency | Shared data reuse within the same block |
Local Memory | Individual threads | High latency | Private storage for register spills |
Registers | Individual threads | Very low latency | Very fast but limited capacity |
Constant Memory | Read-only for GPU threads | Faster than global (cached) | Small read-only data |
Texture Memory | Read-only, specialized | Cached | Often used for 2D/3D data with interpolation |
Global Memory
The largest and slowest memory space. Kernel arguments and large arrays often reside here.
Shared Memory
A fast, on-chip memory shared by threads within the same block. Proper usage can significantly improve performance, but it’s limited in size (commonly tens of kilobytes per block).
Registers
Each thread has access to a set of registers. They are extremely fast but limited. Overusing registers might spill data into local memory, which is stored in global memory.
Launching a Kernel
A kernel launch in CUDA uses a special syntax:
myKernel<<<gridDim, blockDim>>>(args...);
gridDim
specifies how many blocks to launch.blockDim
specifies how many threads per block.
After the triple angle brackets, you pass the actual arguments for the kernel function. Let’s break down an example:
#include <iostream>
__global__ void exampleKernel(int *array, int value) { int idx = blockIdx.x * blockDim.x + threadIdx.x; array[idx] = value + idx;}
int main() { int n = 1024; size_t size = n * sizeof(int);
// Allocate host memory int *h_array = (int*)malloc(size);
// Allocate device memory int *d_array; cudaMalloc((void**)&d_array, size);
// Define grid and block dimensions dim3 blocks(4); dim3 threads(256);
// Launch the kernel exampleKernel<<<blocks, threads>>>(d_array, 10);
// Copy data back to host cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);
// Check results for(int i = 0; i < 10; ++i) { std::cout << "h_array[" << i << "] = " << h_array[i] << std::endl; }
// Cleanup free(h_array); cudaFree(d_array); return 0;}
This code:
- Allocates memory on both the host and the device.
- Launches a kernel with 4 blocks of 256 threads to fill an integer array of length 1024 with a pattern:
value + idx
. - Copies the results back and verifies them.
Remember to handle any CUDA errors (e.g., using cudaGetLastError()
or custom error-checking macros).
A Practical Example: Vector Addition
Vector addition is the “Hello World” of parallel programming. Let’s illustrate it in CUDA to show how threads, blocks, and grids come together in a real application.
The Kernel
__global__ void addVectors(const float *a, const float *b, float *c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; }}
- We calculate
idx
based on the block and thread indices. - We check
if (idx < n)
to avoid out-of-bounds memory access.
The Host Code
#include <iostream>#include <cuda.h>
__global__ void addVectors(const float *a, const float *b, float *c, int n);
int main() { int n = 1 << 20; // 1 million elements size_t size = n * sizeof(float);
// Allocate host memory float *h_a = (float*)malloc(size); float *h_b = (float*)malloc(size); float *h_c = (float*)malloc(size);
// Initialize host arrays for(int i = 0; i < n; i++) { h_a[i] = 1.0f; h_b[i] = 2.0f; }
// Allocate device memory float *d_a, *d_b, *d_c; cudaMalloc((void**)&d_a, size); cudaMalloc((void**)&d_b, size); cudaMalloc((void**)&d_c, size);
// Copy data to device cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
// Define block and grid dimensions int blockSize = 256; int gridSize = (n + blockSize - 1) / blockSize;
// Launch kernel addVectors<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy result back to host cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
// Verify for(int i = 0; i < 10; i++) { std::cout << h_c[i] << " "; } std::cout << std::endl;
// Cleanup free(h_a); free(h_b); free(h_c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;}
__global__ void addVectors(const float *a, const float *b, float *c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; }}
Explanation:
- We allocate and initialize two large vectors,
h_a
andh_b
. - We move them to GPU memory (
d_a
,d_b
). - We launch
addVectors
with enough blocks to cover the entire vector (gridSize = (n + blockSize - 1) / blockSize
). - We retrieve the output vector (
h_c
) from the device and validate a few elements.
Shared Memory for Fast Communication
One of CUDA’s powerful features is shared memory: a low-latency memory space accessible by all threads in a block. This can drastically reduce global memory accesses, improving performance.
Declaring Shared Memory
Within a __global__
or __device__
function, you can declare a shared memory array:
__global__ void kernelWithShared(float *data) { __shared__ float tile[256]; // This is allocated per block int idx = threadIdx.x; tile[idx] = data[idx]; __syncthreads();
// Now all threads in this block can read tile[] float val = tile[(idx+1) % 256]; // ...}
__syncthreads()
is crucial to ensure all writes to shared memory are visible to all threads in the block.
When to Use Shared Memory
- The data must be reused multiple times within a block.
- The size of data is within the hardware’s shared memory limit (often 48KB or 96KB per multiprocessor, depending on configuration and GPU generation).
Synchronization and Atomic Operations
Thread Synchronization
CUDA provides multiple mechanisms for synchronization:
__syncthreads()
- Ensures all threads in a block reach this point before continuing.__syncwarp()
- Synchronizes threads in a warp (on GPUs with compute capability >= 7.0, you can specify a mask).
Going beyond block-level synchronization typically requires splitting the operations into multiple kernels or using more advanced concurrency features.
Atomic Operations
When multiple threads need to update shared data concurrently, you can use atomic operations:
__global__ void atomicAddKernel(int *array) { int idx = blockIdx.x * blockDim.x + threadIdx.x; atomicAdd(&array[0], idx);}
CUDA’s atomicAdd
, atomicSub
, atomicMax
, etc., ensure data integrity, but may reduce performance if contention is high.
Streams and Concurrency
What is a Stream?
A stream is a sequence of operations (kernels, memory copies, etc.) that execute in order on the GPU. By default, kernels run in stream 0, which is synchronous with respect to host code in many cases.
Overlapping Operations
Using multiple streams allows for concurrent:
- Kernel execution in one stream,
- Memory copy in another stream,
- Or simply different kernels executing in parallel if resources permit.
cudaStream_t s1, s2;cudaStreamCreate(&s1);cudaStreamCreate(&s2);
// Launch kernels in different streamskernelA<<<grid, block, 0, s1>>>(...);kernelB<<<grid, block, 0, s2>>>(...);
// Non-blocking if events or synchronization are not usedcudaMemcpyAsync(..., s1);cudaMemcpyAsync(..., s2);
// CleanupcudaStreamDestroy(s1);cudaStreamDestroy(s2);
Effective use of streams can drive better GPU occupancy and overall throughput.
Texture and Constant Memory
Constant Memory
Constant memory is cached read-only memory. If many threads read the same value from constant memory, the caching mechanism can reduce global memory bandwidth usage. You declare it like:
__constant__ float constData[256];
And copy from host to device with:
cudaMemcpyToSymbol(constData, hostData, size);
Texture Memory
Texture memory is specialized and also cached, often used for 2D and 3D data. It provides built-in filtering and addressing modes. While it’s historically associated with graphics, it can also boost performance for certain data access patterns in GPGPU workloads.
Advanced Concepts: Dynamic Parallelism and Unified Memory
Dynamic Parallelism
With dynamic parallelism, kernels can launch other kernels directly from the GPU. For example:
__global__ void childKernel() { // ...}
__global__ void parentKernel() { // Launch child kernel from within the GPU childKernel<<<1, 32>>>();}
int main() { // Launch the parent kernel parentKernel<<<1, 1>>>(); cudaDeviceSynchronize(); return 0;}
This feature can simplify complex workflows where parallel work spawns more parallel work. However, it can also introduce overhead and complicate resource management.
Unified Memory
Unified memory automatically manages data across CPU and GPU, introduced with CUDA 6 and higher. It simplifies memory handling:
float *unifiedData;cudaMallocManaged(&unifiedData, n * sizeof(float));// Access from both host and device without explicit cudaMemcpy
But for performance-critical applications, manual memory management may yield better results.
Performance Optimization and Profiling
Occupancy and Resource Considerations
- Occupancy refers to how many warps can run concurrently on a streaming multiprocessor (SM).
- You can tune thread block sizes, shared memory usage, and registers to improve occupancy.
Coalesced Global Memory Access
Optimize global memory accesses such that consecutive threads access consecutive memory addresses. This is called coalescing and drastically improves bandwidth utilization.
Profiling Tools
NVIDIA provides several profiling and analysis tools:
- NVIDIA Nsight Compute: A low-level kernel profiler.
- NVIDIA Nsight Systems: A system-wide profiler to see how CPU and GPU tasks are scheduled over time.
Use these tools to find bottlenecks in memory bandwidth, compute, or other areas.
Conclusion and Further Reading
By now, you should have a solid grasp of CUDA’s core building blocks:
- Threads: The fundamental execution unit.
- Blocks: Collections of threads, which share memory and can synchronize.
- Grids: Organizations of blocks for large-scale parallel workloads.
- Memory Spaces: Global, shared, local, constant, texture—each designed for different purposes.
- Advanced Features: Streams, dynamic parallelism, and unified memory offer more control and flexibility.
The wonderful thing about CUDA is that it scales to many application domains—machine learning, computational physics, chemistry simulations, video processing, and more. Mastering threads and blocks is your first step; from there, you can delve into specialized topics like warp-level primitives, advanced memory optimizations, and multi-GPU setups for HPC clusters.
Further Reading and References
- NVIDIA’s official CUDA Toolkit Documentation
- Mark Harris’s blog posts on Parallel Forall
- “Programming Massively Parallel Processors: A Hands-on Approach” by David B. Kirk and Wen-mei W. Hwu
Best of luck, and happy coding in CUDA!