Cpp Notes

ch1

Ch 1: Introduction to GPU Kernels and Hardware

CPU Architecture

Overview of CPU Components

  • Master Clock

    • Controls execution timing with pulses at a fixed frequency.
    • Early IBM PCs (1981) ran at 2.2 MHz, increasing over time to 4 GHz (2002).
    • Modern Intel CPUs operate at ~3.5 GHz, with short turbo boosts to 4 GHz.
    • Power consumption and heat generation increase with frequency.
  • Memory

    • Stores program code and data.
    • Read operations are handled by the load/save unit or program fetch unit.
    • Write operations are usually handled only by the load/save unit.
  • Load/Save Unit

    • Reads/writes data between memory and CPU registers.
    • Controlled by the execute unit, which determines read/write operations.
  • Register File

    • High-speed storage within the CPU for temporary data.
    • Data must be in registers before being processed by the ALU.
  • Arithmetic Logic Unit (ALU)

    • Performs arithmetic and logical operations on register data.
  • Execute Unit

    • Decodes instructions from the fetch unit.
    • Manages data transfer to registers and ALU operations.
    • Transfers results back to memory.
  • Fetch Unit

    • Retrieves instructions from main memory.
    • Uses the Program Counter (PC) to track execution.
    • Handles branch instructions by updating the PC accordingly.

CPU Memory Management & Latency Hiding

  • Memory Access Latency

    • CPU accesses memory step by step, causing delays.
    • Typical CPU memory latency: tens of clock cycles.
    • GPU memory latency: hundreds of clock cycles.
  • Caching & Pipelining

    • Used to hide latency and improve performance.
    • Memory access patterns in loops allow hardware to prefetch adjacent data efficiently.
    • Conceptually similar to water flowing through pipes—initial delay, then continuous flow.
  • CPU Cache Hierarchy

    • L1 Cache (Fastest, smallest, per-core)
      • Separate caches for data and instructions.
    • L2 Cache (Larger, slower, per-core)
    • L3 Cache (Largest, slowest, shared across cores)
    • Cache lines (typically 64 bytes) optimize data transfer efficiency.
    • Intel CPUs transfer two adjacent cache lines at once (effective size: 128 bytes).

  • Speculative Execution & Branch Prediction
    • Instruction pipelines speed up execution.
    • Branch instructions break pipelines, requiring advanced speculative execution.
    • The CPU pre-executes multiple potential paths and discards incorrect ones.

3. CPU Parallelism: Vector Instructions (SIMD)

  • Single Instruction, Multiple Data (SIMD)

    • Enables parallel processing within a single CPU core.
  • Evolution of Intel SIMD Instructions

    • SSE (Pentium III, 1999)
      • 128-bit registers, each holding 4× 4-byte floats.
      • Aligned data loads/stores allow single-cycle operations.
    • AVX2 (Modern Intel CPUs)
      • 256-bit registers, supporting more data types.
    • AVX-512 (Latest Intel CPUs)
      • 512-bit registers, holding 16 floats or 8 doubles.
      • Significant speed-up for vectorized computations.
  • Practical Use of AVX

    • Optimizing floating-point computations by processing multiple elements in one instruction.

GPU Origins & Parallel Processing in Graphics

  • Designed for high-performance graphics

    • A 1920×1080 screen at 60 Hz requires computing ~125 million pixels per second.
    • Each pixel's color is computed independently, making this a massively parallel problem.
    • Early CPUs were not powerful enough, leading to the rise of dedicated gaming GPUs.
  • Frame Buffer & Video RAM

    • Stores image data in a 2D array, with 3 bytes per pixel (RGB format).
    • Specialized video RAM allows simultaneous CPU writes and independent GPU reads for rendering.

General-Purpose Computing on GPUs (GPGPU)

  • GPUs evolved from gaming hardware to general-purpose computing tools.
  • 2001: GPGPU concept introduced, enabling GPUs for scientific and computational tasks.
  • 2007: NVIDIA launched CUDA, making GPU programming mainstream.

NVIDIA GPU Product Lines

  1. GeForce (GTX, RTX, Titan) – Gaming GPUs

    • Optimized for FP32 (single-precision) calculations.
    • Limited FP64 (double-precision) support.
    • No ECC (Error-Correcting Code) memory.
  2. Tesla – Scientific & Server GPUs

    • Designed for high-end scientific computing.
    • Strong FP64 support and ECC memory for accuracy.
    • No video output ports, unsuitable for gaming.
    • Used in server farms and data centers.
  3. Quadro – Workstation GPUs

    • Tesla-class hardware with added graphics capabilities.
    • Targeted at high-end workstations for design, visualization, and scientific applications.
  • NVIDIA GPU Generations (2007–2020)
    • Each new generation adds more hardware features.
    • Generations are named after famous scientists.

GPU Hardware Architecture (Pascal GTX 1080 Example)

  • Compute Core (Basic Unit)

    • Performs 32-bit floating-point & integer operations.
    • No individual program counters—executes instructions in groups.
  • Warp Engine (WE) – 32-Core Execution Unit

    • Processes 32 threads (warp) in lockstep using one program counter.
    • Includes:
      • 8 Special Function Units (SFUs) (for fast sin, exp, etc.).
      • 1 or 16 FP64 units (for double-precision operations).
  • Streaming Multiprocessors (SMs) – Higher-Level Processing Units

    • Each SM has multiple Warp Engines.
    • Pascal GPUs:
      • 2 (Tesla GP100) or 4 Warp Engines per SM.
      • 128 compute cores per SM.
    • CUDA thread blocks execute within a single SM.
    • Thread blocks within an SM can communicate, but blocks in different SMs cannot.
    • Shared memory (96 KB) & L1 cache (24 KB or 48 KB) per SM.
  • Full GPU Structure

    • Multiple SMs form the final GPU.
    • Example: GTX 1080
      • 20 SMs × 128 cores = 2560 compute cores.
      • On-chip L2 cache (2–4 GB) shared by all SMs.
  • Different GPU Models for Different Needs

    • Gaming GPUs (e.g., GTX 1030) have fewer SMs.
    • Differences in clock speed, memory size, and performance.

GPU Processing Model (Warp-Based Execution)**

  • Each compute core processes 32-bit data in a stream-like fashion.
  • Warp Engine groups 32 cores into a synchronized execution unit.
  • Streaming Multiprocessors (SMs) contain multiple Warp Engines.
  • Multiple SMs make up the full GPU.

GPU Memory Types & Hierarchical Structure

GPU memory is hierarchically organized, similar to CPU caches, with various levels optimized for different access patterns.

1. Main Memory (Global Memory)

  • Equivalent to CPU RAM—stores program code and data.
  • Accessible by both CPU and GPU via the PCIe bus (slow, should be minimized).
  • Persistent between kernel calls, allowing reuse without reloading.
  • Asynchronous memory transfers allow overlapping computation and data movement (useful for tasks like video frame processing).
  • Texture & constant memory reside in global memory but have dedicated caches.

2. Constant Memory

  • 64 KB reserved in global memory.
  • Optimized for read access:
    • Dedicated cache bypasses L2, allowing fast access if all threads in a warp read the same value.
  • Compiler optimizations:
    • const and restrict hints help the NVCC compiler automatically use constant memory.
    • Explicit use is usually unnecessary due to modern compiler optimizations.
  • Limited size—not suitable for large data tables.

3. Texture Memory

  • Originally for graphics processing, now useful for general-purpose computing.
  • Stores 1D, 2D, or 3D arrays optimized for spatial locality.
  • Read-only with dedicated texture caches.
  • Accessed using special lookup functions:
    • tex1D, tex2D, tex3D – perform fast interpolation (1D linear, 2D bilinear, 3D trilinear).
  • Recent CUDA updates:
    • Layered textures (stacks of indexed 1D/2D textures).
    • Surfaces (can be written to by the GPU).
  • Recommended for image processing and spatial data access.

4. Local Memory

  • Private to each thread, used when registers are insufficient.
  • Not physically separate—stored in global memory, cached via L1 and L2.
  • Automatically managed by the CUDA compiler.

5. Register File

  • Each SM has 64K 32-bit registers.
  • Registers are shared among thread blocks executing on an SM.
  • Performance Consideration:
    • An SM supports up to 64 warps (2048 threads).
    • If a thread uses more than 32 registers, the number of active threads (occupancy) decreases.
    • NVCC option: --maxrregcount <number> allows manual tuning of register usage vs. occupancy.

6. Shared Memory

  • 32 KB – 64 KB per SM (shared among thread blocks on the SM).
  • Fast and useful for intra-thread block communication.
  • Memory allocation:
    • Can be defined at compile time or dynamically at kernel launch.
    • If a kernel requests more than half of shared memory, only one block can run per SM, reducing occupancy.
  • Performance Trade-Off:
    • Early GPUs had poor caching, so shared memory was heavily used to avoid slow global memory accesses.
    • Modern GPUs have better L1/L2 caching, reducing the need for shared memory optimization.
    • Use shared memory judiciously—balance faster memory access against occupancy reduction.

7. Memory Coalescing & Access Optimization

  • Caches (L1 & L2) + high occupancy help hide global memory latency.
  • Memory coalescing:
    • Optimal pattern: 32 threads in a warp should access 32-bit variables in adjacent memory locations.
    • Starting address should be aligned on a 32-word boundary.
  • Early CUDA versions emphasized coalescing due to poor caching.
  • Modern GPUs are more forgiving, but aligned and sequential memory access remains best practice.

Key Takeaways

  • Minimize global memory access—use registers and shared memory where possible.
  • Use constant memory for small read-only datasets accessed by all threads.
  • Texture memory is beneficial for interpolations and spatial access patterns.
  • Too many registers per thread reduces occupancy—use NVCC flags to tune it.
  • Shared memory speeds up intra-block communication, but excessive use can hurt occupancy.
  • Memory coalescing improves caching efficiency, reducing global memory latency.

GPU Memory Types & Hierarchical Structure

GPU memory is hierarchically organized, similar to CPU caches, with various levels optimized for different access patterns.


1. Main Memory (Global Memory)

  • Equivalent to CPU RAM—stores program code and data.
  • Accessible by both CPU and GPU via the PCIe bus (slow, should be minimized).
  • Persistent between kernel calls, allowing reuse without reloading.
  • Asynchronous memory transfers allow overlapping computation and data movement (useful for tasks like video frame processing).
  • Texture & constant memory reside in global memory but have dedicated caches.

2. Constant Memory

  • 64 KB reserved in global memory.
  • Optimized for read access:
    • Dedicated cache bypasses L2, allowing fast access if all threads in a warp read the same value.
  • Compiler optimizations:
    • const and restrict hints help the NVCC compiler automatically use constant memory.
    • Explicit use is usually unnecessary due to modern compiler optimizations.
  • Limited size—not suitable for large data tables.

3. Texture Memory

  • Originally for graphics processing, now useful for general-purpose computing.
  • Stores 1D, 2D, or 3D arrays optimized for spatial locality.
  • Read-only with dedicated texture caches.
  • Accessed using special lookup functions:
    • tex1D, tex2D, tex3D – perform fast interpolation (1D linear, 2D bilinear, 3D trilinear).
  • Recent CUDA updates:
    • Layered textures (stacks of indexed 1D/2D textures).
    • Surfaces (can be written to by the GPU).
  • Recommended for image processing and spatial data access.

4. Local Memory

  • Private to each thread, used when registers are insufficient.
  • Not physically separate—stored in global memory, cached via L1 and L2.
  • Automatically managed by the CUDA compiler.

5. Register File

  • Each SM has 64K 32-bit registers.
  • Registers are shared among thread blocks executing on an SM.
  • Performance Consideration:
    • An SM supports up to 64 warps (2048 threads).
    • If a thread uses more than 32 registers, the number of active threads (occupancy) decreases.
    • NVCC option: --maxrregcount <number> allows manual tuning of register usage vs. occupancy.

6. Shared Memory

  • 32 KB – 64 KB per SM (shared among thread blocks on the SM).
  • Fast and useful for intra-thread block communication.
  • Memory allocation:
    • Can be defined at compile time or dynamically at kernel launch.
    • If a kernel requests more than half of shared memory, only one block can run per SM, reducing occupancy.
  • Performance Trade-Off:
    • Early GPUs had poor caching, so shared memory was heavily used to avoid slow global memory accesses.
    • Modern GPUs have better L1/L2 caching, reducing the need for shared memory optimization.
    • Use shared memory judiciously—balance faster memory access against occupancy reduction.

7. Memory Coalescing & Access Optimization

  • Caches (L1 & L2) + high occupancy help hide global memory latency.
  • Memory coalescing:
    • Optimal pattern: 32 threads in a warp should access 32-bit variables in adjacent memory locations.
    • Starting address should be aligned on a 32-word boundary.
  • Early CUDA versions emphasized coalescing due to poor caching.
  • Modern GPUs are more forgiving, but aligned and sequential memory access remains best practice.

GPU Memory Key Takeaways

  • Minimize global memory access—use registers and shared memory where possible.
  • Use constant memory for small read-only datasets accessed by all threads.
  • Texture memory is beneficial for interpolations and spatial access patterns.
  • Too many registers per thread reduces occupancy—use NVCC flags to tune it.
  • Shared memory speeds up intra-block communication, but excessive use can hurt occupancy.
  • Memory coalescing improves caching efficiency, reducing global memory latency.

Warps and Waves in CUDA Programming

1. Importance of Choosing the Right Number of Threads (Nthreads)

  • The number of threads (Nthreads) is problem-specific and crucial for performance.
  • A good rule of thumb is to maximize Nthreads for full GPU utilization.
  • Examples:
    • For a 1D sum computation with 10⁹ steps → use 10⁹ threads.
    • For image processing with nx × ny pixels → use Nthreads = nx × ny.

2. Misconception: Nthreads = Ncores is Sufficient

  • Reality: A GPU hides memory and execution latencies by rapidly switching between threads.
  • More threads than cores are needed to keep the GPU fully occupied at all times.

3. Example: CUDA Execution Model on an RTX 2070 GPU

  • Hardware Breakdown:
    • Streaming Multiprocessors (SMs): Nsm = 36
    • Warps per SM: Nwarp = 2
    • Threads per warp: 32
    • Total CUDA cores: [ Ncores = Nsm × Nwarp × 32 = 36 × 2 × 32 = 2304 ]
    • Resident Threads per SM (Nres): 1024 (i.e., 32 warps per SM)
    • Active Warps per SM: 2 (remaining 30 warps are suspended, waiting for memory).

4. Latency Hiding & Thread Execution in Waves

  • CUDA processes threads in waves to maximize GPU utilization.
  • Wave Size (Nwave): $$ Nwave = Nres × Nsm = 1024 × 36 = 36864 \text{ threads per wave} $$
  • If launching 10⁹ threads:
    • Total waves required: $$ 10⁹ \div Nwave = 10⁹ \div 36864 ≈ 27127 \text{ waves} $$
    • Last wave may be incomplete if Nthreads is not a multiple of Nwave.
  • Best practice: Set Nthreads to at least Nwave and preferably a multiple of Nwave.

5. GPU Model Variations & Impact on Nwave

  • Turing GPUs (e.g., RTX 2070, RTX 2080):
    • Resident threads per SM (Nres): 1024
    • Example (RTX 2080):
      • Nsm = 46Nwave = 1024 × 46 = 47104
  • Other recent NVIDIA GPUs:
    • Nres = 2048 (double Turing’s value).
    • Same Nwarp = 2, but Nwave doubles due to Nres increase.
    • Varies by model—higher Nsmlarger Nwave.

Key Takeaways

  • Choose Nthreads wisely—large enough to utilize full GPU potential.
  • GPU latency hiding is achieved by keeping many resident threads ready to run.
  • Threads execute in "waves" (Nwave), and optimal Nthreads is a multiple of Nwave.
  • Different GPU models have different Nsm and Nres, affecting Nwave size.

Blocks and Grids in CUDA

1. Thread Blocks: Basic Concept

  • Thread blocks are groups of threads that execute together on the same Streaming Multiprocessor (SM).
  • Key properties:
    • Threads within a block can communicate via shared or global memory.
    • Threads in different blocks cannot communicate during kernel execution.
    • Synchronization is only possible within the same block, not across blocks.
  • Thread block size recommendations:
    • Must be a multiple of warp size (32).
    • Maximum size per block: 1024 threads.
    • Common choice: 256 threads per block (submultiple of 1024).

2. Thread Execution on SMs

  • An SM can hold multiple thread blocks at the same time.
  • Example:
    • If block size is 256 threads, up to 4 thread blocks can be assigned per SM (since 1024 / 256 = 4).
    • On non-Turing GPUs, up to 8 thread blocks may coexist on an SM.
  • Even if multiple thread blocks coexist on the same SM, they cannot communicate with each other.

3. Kernel Launch Configuration: Grid & Blocks

  • CUDA kernels are launched with two key parameters:
    1. Thread block size (threads)
    2. Number of thread blocks (blocks)
  • Total threads (Nthreads): $$ Nthreads = threads × blocks $$
  • Choosing blocks to match problem size (N):
    • Ensure that Nthreads ≥ N (total threads must be at least the required number).

    • If N does not evenly divide by threads, blocks must be rounded up.

    • Example:

      blocks = (N + threads - 1) / threads;
    • Some extra threads may be created, so out-of-range checks are needed in kernel code (e.g., if (threadIdx.x >= N) return;).


4. Waves & SM Assignment

  • NVIDIA documentation does not focus much on waves, but a 2014 blog post by Julien Demouth (link) mentions that threads are dispatched in complete waves when possible.
  • Optimizing Grid Configuration:
    • Ensure blocks is a multiple of the number of SMs (Nsm) for best performance.
    • This helps avoid uneven thread distribution across SMs, reducing the "tail effect" (when some SMs remain underutilized at the end of execution).

Key Takeaways

  • Thread blocks execute on one SM and can communicate via shared memory.
  • Optimal thread block size: Multiple of 32, typically 256 or 512.
  • SMs can process multiple thread blocks at the same time, but blocks do not communicate.
  • Kernel launch requires specifying threads and blocks:
    • Nthreads = threads × blocks.
    • blocks should be rounded up to cover the required N.
  • "Waves" optimize executionblocks should be a multiple of SM count (Nsm) for better load balancing.

CUDA Thread Indexing Variables

Variable Description
threadIdx.x Thread rank within a block (0 to blockDim.x - 1).
blockIdx.x Block rank within the grid (0 to gridDim.x - 1).
blockDim.x Number of threads in one block.
gridDim.x Number of blocks in the grid.
warpSize Number of threads in a warp (always 32 on current NVIDIA GPUs).
Thread rank in block id = threadIdx.x
Thread rank in grid id = blockDim.x * blockIdx.x + threadIdx.x
Total number of threads threads = gridDim.x * blockDim.x

CUDA Occupancy & Resource Constraints

1. Definition of Occupancy

  • Occupancy = (Number of resident threads in SM) / (Maximum resident threads Nres).
  • Expressed as a percentage.
  • 100% occupancy means that complete waves are running on all SMs.

2. Factors Affecting Occupancy

Even if enough threads are launched for full occupancy, other hardware constraints may reduce actual occupancy:

  • Register Limitations
    • NVIDIA GPUs allow up to 32 registers per thread while maintaining full occupancy.
    • Excess register usage reduces active threads per SM, lowering occupancy.
  • Shared Memory Limitations
    • Non-Turing GPUs: 64 KB or 96 KB per SM32 or 48 bytes per thread at full occupancy.
    • Ampere GPUs: Increased to 80 bytes per thread at full occupancy.
    • High shared memory usage reduces the number of resident thread blocks, lowering occupancy.

3. Performance Considerations

  • Lower occupancy is not always bad—acceptable in compute-bound kernels.
  • Memory-bound kernels benefit from higher occupancy to hide memory latency.
  • Optimization strategies:
    • If shared memory usage is too high, consider using global memory and relying on L1 caching instead.
    • Experimentation is required to find the best balance between occupancy and resource usage.