Launch into GPUs: A Newcomer’s Guide to CUDA#

Welcome to your one-stop guide to learning GPU programming with NVIDIA’s CUDA framework. In this blog post, we’ll start from the basics of GPU architecture, then shift into the fundamentals of CUDA, and finally venture into more advanced, professional-level topics. By the end of this article, you should have a strong foundation and a set of tools and insights for tackling parallel programming on NVIDIA GPUs.

Table of Contents#

Introduction to GPUs and Parallel Computing
CUDA Basics
The CUDA Architecture
Hello World in CUDA
Thread Hierarchy and Memory Models
Memory Spaces in CUDA
Synchronization and Concurrency
Advanced Concepts and Best Practices
Using Libraries and Frameworks
Debugging and Profiling
Professional-Level Expansions
Conclusion

Introduction to GPUs and Parallel Computing#

What is GPU Computing?#

A Graphics Processing Unit (GPU) is a specialized processor designed for large-scale parallel computation. Originally created to handle graphics and rendering for games and multimedia, GPUs have become powerful engines for general-purpose computing as well. With thousands of relatively simple cores that can perform millions of operations in parallel, GPUs shine in tasks where data can be processed concurrently (e.g., matrix multiplication, machine learning, image processing).

Why Parallelism?#

Multi-core CPUs also leverage parallelism, but typically on a smaller scale (e.g., 4, 8, 16 cores). GPUs can have hundreds or thousands of cores. One way to think about this difference is to compare a GPU and CPU in terms of design:

CPU focuses on latency optimization (a few general-purpose cores, huge cache, branch prediction).
GPU focuses on throughput (many specialized cores, smaller caches, massive parallel scheduling).

Maintaining parallel threads in a GPU is more efficient, allowing large data sets to be processed simultaneously. The challenge, however, is to adapt algorithms to exploit this parallelism effectively.

Getting Started with CUDA#

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and application programming interface (API). With CUDA, you can write programs that execute on NVIDIA GPUs using an extension of C/C++ (and other supported languages).

To start CUDA development, you generally need:

A system with an NVIDIA GPU.
CUDA Toolkit installed (includes the compiler nvcc, libraries, and tools).

You can verify CUDA installation with:

1
nvcc --version

CUDA Basics#

The Programming Model#

The CUDA programming model is an extension to the C/C++ standard. You write “kernels,” which are special functions executed on the GPU in parallel by many threads simultaneously.

The most basic workflow for a CUDA program is:

Allocate memory on the GPU (device memory).
Transfer data from the host (CPU) to the device.
Launch the kernel.
Transfer results back from the device to the host.
Free the device memory.

Code Structure#

A simple CUDA program has two main parts:

Host code: Runs on the CPU; handles data transfer, kernel launches, memory management, etc.
Device code: The kernel functions running on the GPU in parallel.

The CUDA compiler (nvcc) takes care of both host and device code, generating an executable that can launch GPU kernels while still being called from the CPU.

The CUDA Architecture#

Hardware Anatomy#

NVIDIA GPUs are structured as an array of Streaming Multiprocessors (SMs). Each SM has multiple CUDA cores and other hardware resources. When you launch a kernel, you specify a grid of thread blocks, and each block is assigned to an SM. Within each block, threads execute in groups of 32 called “warps.”

SM (Streaming Multiprocessor): The fundamental hardware block that runs the threads.
Warp: A scheduling unit of 32 threads.

Threads in a warp share an instruction stream but may branch (leading to performance considerations known as “divergence”). Understanding warps and SMs is crucial, because you’ll want to optimize memory access and thread organization to align with this hardware reality.

Thread Hierarchy and Execution#

In CUDA, you organize threads in a three-level hierarchy:

Grid: A collection of blocks.
Block: A group of threads that can cooperate via shared memory and synchronization.
Thread.

The shape of your blocks (1D, 2D, or 3D) and the number of blocks form the overall parallel execution configuration. Tuning these configurations for performance is part of the art of CUDA programming.

Hello World in CUDA#

Let’s jump into a simple “Hello World”–style kernel, though it won’t literally print “Hello World.” Instead, it showcases how to set up a kernel and run it.

Step-by-Step Explanation#

Define a kernel as a __global__ function.
Specify block and grid dimensions when launching the kernel.
Use the built-in thread indexing variables (threadIdx, blockIdx, blockDim) to identify which thread is doing the work.

Example Code#

1
#include <iostream>
2

3
// Kernel definition
4
__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
5
    // Global thread ID
6
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
7
    if (idx < N) {
8
        // Perform the addition
9
        C[idx] = A[idx] + B[idx];
10
    }
11
}
12

13
int main() {
14
    int N = 1024;
15
    size_t size = N * sizeof(float);
16

17
    // Allocate host memory
18
    float* h_A = (float*)malloc(size);
19
    float* h_B = (float*)malloc(size);
20
    float* h_C = (float*)malloc(size);
21

22
    // Initialize host arrays
23
    for(int i = 0; i < N; i++) {
24
        h_A[i] = static_cast<float>(i);
25
        h_B[i] = static_cast<float>(2*i);
26
    }
27

28
    // Allocate device memory
29
    float *d_A, *d_B, *d_C;
30
    cudaMalloc(&d_A, size);
31
    cudaMalloc(&d_B, size);
32
    cudaMalloc(&d_C, size);
33

34
    // Copy data from host to device
35
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
36
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
37

38
    // Define kernel launch parameters
39
    int blocks = (N + 255) / 256;
40
    int threadsPerBlock = 256;
41

42
    // Launch the kernel
43
    vectorAdd<<<blocks, threadsPerBlock>>>(d_A, d_B, d_C, N);
44

45
    // Copy results back to host
46
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
47

48
    // Print a few sample results
49
    std::cout << "C[0] = " << h_C[0] << ", C[1] = " << h_C[1]
50
              << ", C[1023] = " << h_C[1023] << std::endl;
51

52
    // Clean up
53
    cudaFree(d_A);
54
    cudaFree(d_B);
55
    cudaFree(d_C);
56
    free(h_A);
57
    free(h_B);
58
    free(h_C);
59

60
    return 0;
61
}

A few key points:

__global__ indicates a kernel function launched from host code.

You calculate your thread’s index with:

1
int idx = blockDim.x * blockIdx.x + threadIdx.x;

You need to check bounds (i.e., idx < N) because some threads in the last block might exceed array size.
<<<blocks, threadsPerBlock>>> is the kernel launch syntax.

Thread Hierarchy and Memory Models#

Grids and Blocks#

When you launch a kernel, you specify the grid dimensions (dim3 gridDim) and block dimensions (dim3 blockDim). Each block will contain a certain number of threads, specified by blockDim. Then you can use:

blockIdx.x, blockIdx.y, blockIdx.z: The block index within the grid.
threadIdx.x, threadIdx.y, threadIdx.z: The thread index within the block.
blockDim.x, blockDim.y, blockDim.z: The dimension (size) of each block.

Indexing#

You can generalize the indexing technique for 2D or 3D data. For example, in 2D:

1
int col = blockIdx.x * blockDim.x + threadIdx.x;
2
int row = blockIdx.y * blockDim.y + threadIdx.y;

This way, you can account for two-dimensional arrays and assign one thread per element.

Memory Spaces in CUDA#

Understanding memory spaces is critical to achieving good performance.

Global Memory#

Large memory accessible by all threads (both for reading and writing).
High latency compared to quicker caches.
For large data sets (e.g. arrays, matrices).

Shared Memory#

Shared among threads within the same block.
Very fast on-chip memory (like a user-managed cache).
Helps reduce global memory accesses.

Registers#

Per-thread, low latency, used for local variables.
Extremely fast, but limited in size.

Constant Memory#

Read-only memory space optimized for broadcast accesses.
Good for constants that multiple threads need to read repeatedly.

Texture/Surface Memory#

Specialized read-only memory with caching benefits for certain access patterns (e.g., 2D/3D sampling).

Example Table of Memory Spaces#

Memory	Scope	Cache	Typical Usage
Global	All threads	L2	Main data arrays
Shared	Block-specific	N/A or L1	Fast local data sharing
Registers	Per-thread	N/A	Local thread variables
Constant	All threads	Dedicated	Broadcast read-only data
Texture	All threads	Texturing hardware	Spatially correlated reads

Your strategy involves minimizing global memory access and making heavy use of shared memory for commonly accessed data. Properly coalescing global memory access is also essential for performance.

Synchronization and Concurrency#

Thread Synchronization#

Within a block, threads can synchronize using __syncthreads(). This is a barrier that halts all threads in the block until everyone reaches that point. This is useful when:

Multiple threads need to collaborate on shared memory.
One thread needs to write data to shared memory before another thread can read it.

However, there’s no direct way to synchronize across blocks within a single kernel launch. For cross-block synchronization, you typically need to end the kernel launch or rely on advanced features like Cooperative Groups (on compatible hardware).

Atomic Operations#

When multiple threads attempt to update shared/global memory, you can use atomic operations to avoid race conditions. CUDA provides built-in atomic functions such as atomicAdd, atomicSub, atomicMin, atomicMax, etc., that ensure only one thread at a time updates the variable.

Streams and Concurrency#

CUDA streams allow concurrency between the CPU and GPU, or even among multiple GPU kernels. You can launch multiple kernels in different streams, and as long as resources are available, the GPU can run them concurrently.

Example of using streams:

1
cudaStream_t stream1, stream2;
2
cudaStreamCreate(&stream1);
3
cudaStreamCreate(&stream2);
4

5
// Launch kernel on stream1
6
kernel1<<<grid, block, 0, stream1>>>(...);
7

8
// Launch another kernel on stream2
9
kernel2<<<grid, block, 0, stream2>>>(...);
10

11
// Optional: Sync
12
cudaStreamSynchronize(stream1);
13
cudaStreamSynchronize(stream2);
14

15
cudaStreamDestroy(stream1);
16
cudaStreamDestroy(stream2);

This kind of concurrency helps increase GPU utilization and overlap computations with data transfers.

Advanced Concepts and Best Practices#

Memory Coalescing#

Memory coalescing ensures contiguous memory requests are grouped into as few transactions as possible. For best performance:

Align memory accesses with warp boundaries (regions of 32 consecutive threads).
Organize data in row-major order for row-based accesses in 2D arrays, for instance.

Occupancy and Resource Utilization#

Occupancy is the ratio of active warps per SM to the maximum number of warps supported. While high occupancy can help hide latencies, it doesn’t always guarantee the best performance. Optimizing resource usage (registers, shared memory) can increase occupancy, but the best setting depends on your specific kernel.

CUDA C++ and Template Programming#

CUDA allows you to use advanced C++ (templates, classes, lambdas as of certain CUDA versions). For example, you can write templated kernels to handle different data types or dimensions:

1
template<typename T>
2
__global__ void myKernel(T* data, int N) {
3
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
4
    if (idx < N) {
5
        data[idx] = data[idx] + static_cast<T>(1);
6
    }
7
}

This saves on code duplication, especially when working with different numeric types.

Asynchronous Memory Copy#

Using asynchronous memory copies (cudaMemcpyAsync) in combination with streams can further improve performance by overlapping data transfers with kernel execution.

Using Libraries and Frameworks#

CUDA provides a rich ecosystem of libraries that accelerate common operations:

cuBLAS: Optimized BLAS (Basic Linear Algebra Subprograms).
cuDNN: Deep Neural Network primitives.
cuFFT: Fast Fourier Transform library.
Thrust: A C++ template library resembling the C++ Standard Template Library (STL) for parallel algorithms (sort, reduce, transform, etc.).
cuSPARSE: Sparse matrix operations.

Example: Using cuBLAS

1
#include <cublas_v2.h>
2

3
cublasHandle_t handle;
4
cublasCreate(&handle);
5

6
// Multiply two matrices using cuBLAS
7
float alpha = 1.0f, beta = 0.0f;
8
cublasSgemm(handle,
9
            CUBLAS_OP_N, CUBLAS_OP_N,
10
            M, N, K,
11
            &alpha,
12
            d_A, M,
13
            d_B, K,
14
            &beta,
15
            d_C, M);
16

17
cublasDestroy(handle);

By leveraging these libraries, you can save time and ensure highly optimized performance.

Debugging and Profiling#

Debugging Tools#

cuda-gdb: A CUDA extension to GDB for debugging kernels.
NSight Eclipse Edition (Linux) or NVIDIA Nsight Visual Studio Edition (Windows): Powerful IDE add-ons for debugging and profiling.

Profiling for Performance#

nvprof (deprecated in newer releases, replaced by Nsight Systems / Nsight Compute).
Nsight Systems and Nsight Compute: Extensive GPU performance analysis for memory usage, warp efficiency, instruction throughput, etc.

Checking the occupancy, memory transfer times, and kernel launch configurations helps you spot bottlenecks in your code.

Professional-Level Expansions#

GPU Multitasking and MPS#

On multi-user GPU servers, you may encounter Multi-Process Service (MPS), allowing multiple CUDA applications to share the same GPU context more efficiently. Administrators can configure MPS to improve throughput in HPC or data center environments.

Unified Memory#

Unified Memory automatically manages data migration between the CPU and GPU, freeing you from explicit memory transfers. While convenient, this may introduce performance overhead if memory accesses are unpredictable or if your code does frequent migrations.

Multi-GPU and Peer-to-Peer#

For very large problems, you might use multiple GPUs in one system. With Peer-to-Peer (P2P) communication, GPUs can directly communicate without routing data through the CPU. CUDA-aware libraries like MPI can further facilitate multi-node GPU clusters.

Example for Multi-GPU Memory Copy:#

1
int device0 = 0;
2
int device1 = 1;
3

4
cudaSetDevice(device0);
5
float* d_data0;
6
cudaMalloc(&d_data0, size);
7

8
cudaSetDevice(device1);
9
float* d_data1;
10
cudaMalloc(&d_data1, size);
11

12
cudaDeviceEnablePeerAccess(device0, 0);
13
cudaDeviceEnablePeerAccess(device1, 0);
14

15
// Now you can issue cudaMemcpy between d_data0 and d_data1 directly
16
cudaMemcpy(d_data1, d_data0, size, cudaMemcpyDeviceToDevice);

Custom CUDA Kernels for Machine Learning#

While frameworks like TensorFlow and PyTorch abstract CUDA details, advanced ML developers may create custom CUDA kernels for specialized operations. This includes custom activation functions, advanced indexing, or new types of data augmentation on-the-fly.

Conclusion#

Congratulations on making it through this comprehensive introduction to GPU computing with CUDA! We covered:

Basic GPU concepts and how parallel computing differs from CPU computing.
The CUDA architecture and how threads are organized.
Memory spaces and how to use them efficiently.
Synchronization, concurrency, and best practices in CUDA kernels.
Debugging, profiling, and advanced topics like multi-GPU programming.

At this point, you should be equipped to:

Write and launch basic CUDA kernels.
Manage device memory and leverage shared memory for better performance.
Apply synchronization techniques and concurrency to maximize GPU utilization.
Use existing CUDA libraries or build custom kernels for specialized use cases.
Move on to professional expansions like multi-GPU setups, unified memory, and peer-to-peer communication.

Remember: the key to mastering CUDA is continuous experimentation and performance tuning. Use the provided tools (profilers, debuggers) and libraries to refine your applications. With practice, you’ll become adept at unleashing the massively parallel power of GPUs. Good luck on your CUDA journey!