My notes while reading about GPUs

15 Mar, 2025

I had a bunch of notion pages in which I had written some notes while reading and watching videos on GPUs for CUDA purpose so thought of doing vibe blogging by giving Claude my notes and tell it to form it in the form of a blog

Hope this one helps!

Why GPUs Matter for Modern Engineering

In today's computational landscape, Graphics Processing Units (GPUs) have evolved far beyond their original purpose of rendering video game graphics. They've become powerful general-purpose computing workhorses, accelerating everything from machine learning to scientific simulations. As an engineer, understanding GPU architecture and programming models can unlock tremendous performance improvements for data-parallel workloads.

This guide will introduce you to GPU computing fundamentals, explaining how GPUs differ from CPUs, the basic programming model, and memory considerations that are essential for effective GPU development.

CPU vs GPU: Understanding the Architectural Differences

At their core, CPUs and GPUs represent fundamentally different design philosophies:

CPU Architecture: Optimized for Sequential Performance

CPUs are designed with:

Fewer cores (typically 4-64) with complex control units
Sophisticated branch prediction and out-of-order execution
Deep cache hierarchies (L1, L2, L3)
Optimized for low-latency, sequential processing

A modern CPU prioritizes minimizing the time to complete individual tasks through sophisticated control logic and cache hierarchies.

GPU Architecture: Designed for Parallel Throughput

GPUs take a radically different approach with:

Many simple cores (often thousands)
Minimal control logic per core
Simpler cache hierarchy
Optimized for high-throughput parallel computation

This design makes GPUs extraordinarily efficient at processing large datasets where the same operation needs to be performed across many data points simultaneously.

CUDA: The Language of GPU Computing

To harness GPU power, you'll need a framework that allows you to program these devices. NVIDIA's CUDA (Compute Unified Device Architecture) is one of the most popular:

Key CUDA Terminology

Host: The CPU and its memory
Device: The GPU and its memory
Kernel: A function that runs on the GPU
SIMT/SIMD: Single Instruction Multiple Threads/Data - the execution model of GPUs

The Thread Hierarchy: How GPUs Organize Work

CUDA uses a hierarchical execution model:

Threads: The basic execution unit - each runs the same code but on different data
Blocks: Groups of threads that can communicate and synchronize
Grid: A collection of blocks that form a complete kernel execution

This hierarchy maps elegantly to hardware:

Blocks are assigned to Streaming Multiprocessors (SMs)
Threads within blocks execute on cores within those SMs

This organization enables massive parallelism while providing necessary synchronization mechanisms.

Memory in the GPU World

Understanding GPU memory is crucial for writing efficient code. Let's explore the memory hierarchy:

Global Memory

Largest memory pool on the GPU
Accessible by all threads from all blocks and the CPU
Highest latency among GPU memory types
Used for transferring data between host and device

Shared Memory

Much faster than global memory but smaller in size
Visible only to threads within the same block
Allows threads to share results and temporary calculations
Often described as "programmer-controlled cache"

Constant Memory

Read-only during kernel execution
Cached and optimized for broadcast access patterns
Perfect for values that don't change and are accessed by many threads

Registers

Fastest memory on the GPU
Thread-local (each thread has its own)
Limited in number
Automatically managed by the CUDA compiler

The GPU Computing Pipeline

A typical GPU-accelerated workflow follows these steps:

Allocate and initialize resources on the host (CPU)
Allocate memory on the device (GPU)
Transfer data from host to device
Execute GPU kernels to process the data
Transfer results from device back to host

This pattern highlights a key consideration in GPU programming: data movement between host and device can be expensive, so minimizing transfers is essential for performance.

Memory Types and Management

When working with GPU memory, you'll encounter several allocation strategies:

Pageable memory: Standard CPU memory allocation
Pinned memory: Non-pageable memory that enables faster transfers
Mapped memory: Memory accessible by both CPU and GPU
Unified memory: Automatically managed memory visible to both CPU and GPU

Choosing the right memory type for your application can significantly impact performance.

Getting Started with GPU Programming

If you're new to GPU computing, here are some practical first steps:

Start small: Begin with simple examples that perform basic operations
Think parallel: Redesign your algorithms to expose parallelism
Focus on memory: Pay attention to memory access patterns and transfers
Profile early: Use tools like NVIDIA Nsight to identify bottlenecks

Conclusion

GPUs represent a powerful tool in the modern engineer's arsenal. Their massive parallel processing capabilities can accelerate computationally intensive tasks by orders of magnitude when properly utilized. While there's certainly a learning curve to effective GPU programming, the performance benefits make it well worth the investment for many applications.