From Threads to Blocks: Fundamental CUDA Concepts Explained#

Welcome to this comprehensive guide on understanding CUDA’s core concepts—from the smallest unit of computation (the thread) all the way to large-scale GPU grid structures. This blog post aims to walk you step by step through the basics of GPU programming, demystify essential terminologies, and provide practical code snippets. Whether you’re new to CUDA or looking to refine your understanding, this is the place to start.

Table of Contents#

Why GPU Computing?
A First Look at CUDA
Threads and Warps
Blocks: Grouping Threads for Parallel Execution
Grids: Organizing the Execution Space
Memory Hierarchy
Launching a Kernel
A Practical Example: Vector Addition
Shared Memory for Fast Communication
Synchronization and Atomic Operations
Streams and Concurrency
Texture and Constant Memory
Advanced Concepts: Dynamic Parallelism and Unified Memory
Performance Optimization and Profiling
Conclusion and Further Reading

Why GPU Computing?#

Traditionally, computations have run on CPUs. Modern CPUs outperform their predecessors by increasing clock speed and adding multiple cores. However, GPUs (Graphics Processing Units) have taken a different route: massive parallelism. Instead of a few powerful cores, GPUs include hundreds or even thousands of simpler cores capable of handling a large number of concurrent threads.

This design is especially well-suited for tasks that can be broken down into parallel workloads—such as graphics rendering, matrix multiplication, and many scientific simulations. By offloading compute-intensive tasks to a GPU, developers often achieve speed-ups measured in multiples (or even orders of magnitude) compared to running on a CPU alone.

A First Look at CUDA#

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that exposes GPU functionality for general-purpose computing. It extends C/C++ (and other languages) with keywords and constructs dedicated to parallel execution.

Here are a few fundamental ideas in CUDA:

Host vs. Device: The CPU is commonly referred to as the “host,” while the GPU is called the “device.”
Kernels: Special functions, qualified with the __global__ keyword, that run on the GPU. When you launch a kernel, you spawn many parallel threads on the device.
Thread Hierarchies: You define how many threads to create, how they are grouped into blocks, and how those blocks form a grid.

By understanding threads, blocks, and grids, you can effectively harness the computational power of modern GPUs.

Threads and Warps#

Threads: The Smallest Unit#

A thread is the smallest unit of execution on a GPU. Each thread runs a particular instance of a kernel. Compared to a CPU thread, GPU threads are more lightweight, and you can have thousands or millions of them active at once.

Warps: A Hardware Concept#

When you request a certain number of threads, the GPU hardware will schedule them in groups called warps (typically of size 32 threads on NVIDIA GPUs). All threads in a warp execute the same instruction simultaneously (SIMT, or Single Instruction, Multiple Threads). Divergence within a warp (e.g., divergent if statements) can reduce efficiency.

Blocks: Grouping Threads for Parallel Execution#

Why Blocks Matter#

Threads are grouped into blocks. A block is an array (1D, 2D, or 3D) of threads, and it provides:

A rich set of thread indexing capabilities.
Shared memory for better data sharing among threads in the same block.
Synchronization mechanisms such as __syncthreads().

Thread Indexing Within a Block#

Each thread within a block has an ID accessible via CUDA built-in variables like:

threadIdx.x, threadIdx.y, threadIdx.z (the thread’s coordinate within the block).
blockDim.x, blockDim.y, blockDim.z (the block’s size along each dimension).

Typically, you compute a global index when accessing data in memory:

1
__global__ void myKernel(float *data) {
2
    // Compute the global thread index for a 1D grid
3
    int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
4
    float value = data[globalThreadId];
5
    // ... do something with value
6
}

In this example, globalThreadId uniquely identifies each thread across the entire grid. For multi-dimensional scenarios, you extend this logic with blockIdx.y, threadIdx.y, and so on.

Block Size Considerations#

Selecting the right block size is crucial for performance. Key tips:

Typically, you want each block to have a number of threads that is a multiple of the warp size (32).
Common block sizes include 128, 256, or 512 threads per block.
The maximum number of threads per block is GPU-dependent (up to 1024 on many modern GPUs).

Grids: Organizing the Execution Space#

While a block represents a cluster of threads, a grid consists of one or more blocks. Like blocks, a grid can be 1D, 2D, or 3D.

Grid Dimensions#

gridDim.x, gridDim.y, and gridDim.z store the number of blocks along each dimension.
blockIdx.x, blockIdx.y, and blockIdx.z identify the block’s position within the grid.

Conceptually:

You define how many blocks you want in your grid.
You define how many threads go in each block.

For instance, suppose you have 1024 elements to process and decide to run 256 threads per block. That means you need 4 blocks in total for a 1D arrangement:

1
dim3 blocks(4);
2
dim3 threads(256);
3
myKernel<<<blocks, threads>>>(deviceData);

Here, gridDim.x = 4, blockDim.x = 256, and total threads = 4 * 256 = 1024.

Memory Hierarchy#

CUDA exposes several memory spaces with different performance characteristics. Understanding these is vital to writing efficient code.

Memory Space	Scope	Access Time	Typical Usage
Global Memory	All threads in the grid	High latency	Largest space; data typically resides here
Shared Memory	Threads within a block	Low latency	Shared data reuse within the same block
Local Memory	Individual threads	High latency	Private storage for register spills
Registers	Individual threads	Very low latency	Very fast but limited capacity
Constant Memory	Read-only for GPU threads	Faster than global (cached)	Small read-only data
Texture Memory	Read-only, specialized	Cached	Often used for 2D/3D data with interpolation

Global Memory#

The largest and slowest memory space. Kernel arguments and large arrays often reside here.

Shared Memory#

A fast, on-chip memory shared by threads within the same block. Proper usage can significantly improve performance, but it’s limited in size (commonly tens of kilobytes per block).

Registers#

Each thread has access to a set of registers. They are extremely fast but limited. Overusing registers might spill data into local memory, which is stored in global memory.

Launching a Kernel#

A kernel launch in CUDA uses a special syntax:

1
myKernel<<<gridDim, blockDim>>>(args...);

gridDim specifies how many blocks to launch.
blockDim specifies how many threads per block.

After the triple angle brackets, you pass the actual arguments for the kernel function. Let’s break down an example:

1
#include <iostream>
2

3
__global__ void exampleKernel(int *array, int value) {
4
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
5
    array[idx] = value + idx;
6
}
7

8
int main() {
9
    int n = 1024;
10
    size_t size = n * sizeof(int);
11

12
    // Allocate host memory
13
    int *h_array = (int*)malloc(size);
14

15
    // Allocate device memory
16
    int *d_array;
17
    cudaMalloc((void**)&d_array, size);
18

19
    // Define grid and block dimensions
20
    dim3 blocks(4);
21
    dim3 threads(256);
22

23
    // Launch the kernel
24
    exampleKernel<<<blocks, threads>>>(d_array, 10);
25

26
    // Copy data back to host
27
    cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);
28

29
    // Check results
30
    for(int i = 0; i < 10; ++i) {
31
        std::cout << "h_array[" << i << "] = " << h_array[i] << std::endl;
32
    }
33

34
    // Cleanup
35
    free(h_array);
36
    cudaFree(d_array);
37
    return 0;
38
}

This code:

Allocates memory on both the host and the device.
Launches a kernel with 4 blocks of 256 threads to fill an integer array of length 1024 with a pattern: value + idx.
Copies the results back and verifies them.

Remember to handle any CUDA errors (e.g., using cudaGetLastError() or custom error-checking macros).

A Practical Example: Vector Addition#

Vector addition is the “Hello World” of parallel programming. Let’s illustrate it in CUDA to show how threads, blocks, and grids come together in a real application.

The Kernel#

1
__global__ void addVectors(const float *a, const float *b, float *c, int n) {
2
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
3
    if (idx < n) {
4
        c[idx] = a[idx] + b[idx];
5
    }
6
}

We calculate idx based on the block and thread indices.
We check if (idx < n) to avoid out-of-bounds memory access.

The Host Code#

1
#include <iostream>
2
#include <cuda.h>
3

4
__global__ void addVectors(const float *a, const float *b, float *c, int n);
5

6
int main() {
7
    int n = 1 << 20;  // 1 million elements
8
    size_t size = n * sizeof(float);
9

10
    // Allocate host memory
11
    float *h_a = (float*)malloc(size);
12
    float *h_b = (float*)malloc(size);
13
    float *h_c = (float*)malloc(size);
14

15
    // Initialize host arrays
16
    for(int i = 0; i < n; i++) {
17
        h_a[i] = 1.0f;
18
        h_b[i] = 2.0f;
19
    }
20

21
    // Allocate device memory
22
    float *d_a, *d_b, *d_c;
23
    cudaMalloc((void**)&d_a, size);
24
    cudaMalloc((void**)&d_b, size);
25
    cudaMalloc((void**)&d_c, size);
26

27
    // Copy data to device
28
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
29
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
30

31
    // Define block and grid dimensions
32
    int blockSize = 256;
33
    int gridSize = (n + blockSize - 1) / blockSize;
34

35
    // Launch kernel
36
    addVectors<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
37

38
    // Copy result back to host
39
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
40

41
    // Verify
42
    for(int i = 0; i < 10; i++) {
43
        std::cout << h_c[i] << " ";
44
    }
45
    std::cout << std::endl;
46

47
    // Cleanup
48
    free(h_a); free(h_b); free(h_c);
49
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
50
    return 0;
51
}
52

53
__global__ void addVectors(const float *a, const float *b, float *c, int n) {
54
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
55
    if (idx < n) {
56
        c[idx] = a[idx] + b[idx];
57
    }
58
}

Explanation:#

We allocate and initialize two large vectors, h_a and h_b.
We move them to GPU memory (d_a, d_b).
We launch addVectors with enough blocks to cover the entire vector (gridSize = (n + blockSize - 1) / blockSize).
We retrieve the output vector (h_c) from the device and validate a few elements.

Shared Memory for Fast Communication#

One of CUDA’s powerful features is shared memory: a low-latency memory space accessible by all threads in a block. This can drastically reduce global memory accesses, improving performance.

Declaring Shared Memory#

Within a __global__ or __device__ function, you can declare a shared memory array:

1
__global__ void kernelWithShared(float *data) {
2
    __shared__ float tile[256];  // This is allocated per block
3
    int idx = threadIdx.x;
4
    tile[idx] = data[idx];
5
    __syncthreads();
6

7
    // Now all threads in this block can read tile[]
8
    float val = tile[(idx+1) % 256];
9
    // ...
10
}

__syncthreads() is crucial to ensure all writes to shared memory are visible to all threads in the block.

When to Use Shared Memory#

The data must be reused multiple times within a block.
The size of data is within the hardware’s shared memory limit (often 48KB or 96KB per multiprocessor, depending on configuration and GPU generation).

Synchronization and Atomic Operations#

Thread Synchronization#

CUDA provides multiple mechanisms for synchronization:

__syncthreads() - Ensures all threads in a block reach this point before continuing.
__syncwarp() - Synchronizes threads in a warp (on GPUs with compute capability >= 7.0, you can specify a mask).

Going beyond block-level synchronization typically requires splitting the operations into multiple kernels or using more advanced concurrency features.

Atomic Operations#

When multiple threads need to update shared data concurrently, you can use atomic operations:

1
__global__ void atomicAddKernel(int *array) {
2
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
3
    atomicAdd(&array[0], idx);
4
}

CUDA’s atomicAdd, atomicSub, atomicMax, etc., ensure data integrity, but may reduce performance if contention is high.

Streams and Concurrency#

What is a Stream?#

A stream is a sequence of operations (kernels, memory copies, etc.) that execute in order on the GPU. By default, kernels run in stream 0, which is synchronous with respect to host code in many cases.

Overlapping Operations#

Using multiple streams allows for concurrent:

Kernel execution in one stream,
Memory copy in another stream,
Or simply different kernels executing in parallel if resources permit.

1
cudaStream_t s1, s2;
2
cudaStreamCreate(&s1);
3
cudaStreamCreate(&s2);
4

5
// Launch kernels in different streams
6
kernelA<<<grid, block, 0, s1>>>(...);
7
kernelB<<<grid, block, 0, s2>>>(...);
8

9
// Non-blocking if events or synchronization are not used
10
cudaMemcpyAsync(..., s1);
11
cudaMemcpyAsync(..., s2);
12

13
// Cleanup
14
cudaStreamDestroy(s1);
15
cudaStreamDestroy(s2);

Effective use of streams can drive better GPU occupancy and overall throughput.

Texture and Constant Memory#

Constant Memory#

Constant memory is cached read-only memory. If many threads read the same value from constant memory, the caching mechanism can reduce global memory bandwidth usage. You declare it like:

1
__constant__ float constData[256];

And copy from host to device with:

1
cudaMemcpyToSymbol(constData, hostData, size);

Texture Memory#

Texture memory is specialized and also cached, often used for 2D and 3D data. It provides built-in filtering and addressing modes. While it’s historically associated with graphics, it can also boost performance for certain data access patterns in GPGPU workloads.

Advanced Concepts: Dynamic Parallelism and Unified Memory#

Dynamic Parallelism#

With dynamic parallelism, kernels can launch other kernels directly from the GPU. For example:

1
__global__ void childKernel() {
2
    // ...
3
}
4

5
__global__ void parentKernel() {
6
    // Launch child kernel from within the GPU
7
    childKernel<<<1, 32>>>();
8
}
9

10
int main() {
11
    // Launch the parent kernel
12
    parentKernel<<<1, 1>>>();
13
    cudaDeviceSynchronize();
14
    return 0;
15
}

This feature can simplify complex workflows where parallel work spawns more parallel work. However, it can also introduce overhead and complicate resource management.

Unified Memory#

Unified memory automatically manages data across CPU and GPU, introduced with CUDA 6 and higher. It simplifies memory handling:

1
float *unifiedData;
2
cudaMallocManaged(&unifiedData, n * sizeof(float));
3
// Access from both host and device without explicit cudaMemcpy

But for performance-critical applications, manual memory management may yield better results.

Performance Optimization and Profiling#

Occupancy and Resource Considerations#

Occupancy refers to how many warps can run concurrently on a streaming multiprocessor (SM).
You can tune thread block sizes, shared memory usage, and registers to improve occupancy.

Coalesced Global Memory Access#

Optimize global memory accesses such that consecutive threads access consecutive memory addresses. This is called coalescing and drastically improves bandwidth utilization.

Profiling Tools#

NVIDIA provides several profiling and analysis tools:

NVIDIA Nsight Compute: A low-level kernel profiler.
NVIDIA Nsight Systems: A system-wide profiler to see how CPU and GPU tasks are scheduled over time.

Use these tools to find bottlenecks in memory bandwidth, compute, or other areas.

Conclusion and Further Reading#

By now, you should have a solid grasp of CUDA’s core building blocks:

Threads: The fundamental execution unit.
Blocks: Collections of threads, which share memory and can synchronize.
Grids: Organizations of blocks for large-scale parallel workloads.
Memory Spaces: Global, shared, local, constant, texture—each designed for different purposes.
Advanced Features: Streams, dynamic parallelism, and unified memory offer more control and flexibility.

The wonderful thing about CUDA is that it scales to many application domains—machine learning, computational physics, chemistry simulations, video processing, and more. Mastering threads and blocks is your first step; from there, you can delve into specialized topics like warp-level primitives, advanced memory optimizations, and multi-GPU setups for HPC clusters.