Parallel Programming Notes (OpenMP, MPI, CUDA)
1. Big Picture
Parallel performance is constrained by three coupled factors:
- Algorithmic parallelism (how much work can be done concurrently),
- Data movement (memory hierarchy and communication cost),
- Synchronization overhead (dependencies and coordination).
A practical workflow is:
- optimize serial locality first,
- scale on shared memory (OpenMP),
- scale across nodes (MPI),
- offload data-parallel kernels (CUDA/GPU).
2. Performance Metrics and Limits
Let $T_s$ be best serial time and $T_p$ parallel time on $p$ processors:
\[S_p = \frac{T_s}{T_p}, \qquad E_p = \frac{S_p}{p}.\]Amdahl-style bound with serial fraction $f$:
\[S_p \le \frac{1}{f + \frac{1-f}{p}}.\]Implications:
- even small serial sections cap scalability,
- communication and synchronization often increase effective serial fraction,
- weak scaling and strong scaling should both be measured.
3. Memory Hierarchy and Locality
Slides emphasize that many programs run far below peak because of memory bottlenecks.
Hierarchy: registers -> L1/L2/L3 cache -> DRAM -> secondary storage.
Optimization principles:
- exploit temporal locality (reuse loaded data),
- exploit spatial locality (contiguous access),
- increase computational intensity (FLOPs per byte moved),
- use blocking/tiling and loop unrolling where appropriate.
For matrix multiplication, naive loops are often bandwidth-limited. Blocking raises cache reuse and moves performance toward compute-bound behavior.
4. Shared-Memory Parallelism (OpenMP)
4.1 Execution model
OpenMP provides directive-based multithreading with shared variables and private thread-local variables.
Core constructs:
parallel,for,single,sections,task,- synchronization:
critical,atomic,barrier, - reductions via
reduction(...)clause.
4.2 Scheduling
for schedules control iteration distribution:
static[,chunk]: predictable, low runtime overhead,dynamic[,chunk]: better for load imbalance, higher scheduling overhead.
4.3 Dependency considerations
- instruction-level dependencies,
- loop-carried dependencies,
- reduction patterns (sum/min/max),
- scan/prefix operations (parallel but requires staged synchronization).
5. Distributed-Memory Parallelism (MPI)
5.1 Model
Each process has private memory; explicit communication is required.
Design steps:
- partition data,
- assign work,
- design communication pattern,
- minimize communication volume + synchronization.
5.2 Point-to-point and nonblocking
- blocking:
MPI_Send,MPI_Recv, - combined exchange:
MPI_Sendrecv, - nonblocking:
MPI_Isend,MPI_Irecv+ completion calls.
Nonblocking communication allows overlap of communication and computation when dependencies permit.
5.3 Collectives and data movement
Collectives structure frequent communication motifs; derived datatypes reduce manual packing/unpacking for non-contiguous data.
6. Parallel Dense Linear Algebra Case Study: GEPP
Gaussian Elimination with Partial Pivoting (GEPP):
\[PA = LU.\]Partial pivoting improves practical numerical stability by selecting large column pivots.
Performance issue: naive elimination relies heavily on BLAS-1/2 operations; modern hardware favors BLAS-3.
Key optimization from slides:
- blocked GEPP with delayed updates,
- accumulate rank-1 updates and apply as matrix-matrix operations,
- choose block size to balance cache fit and BLAS-3 efficiency.
This is the recurring HPC theme: reformulate for higher arithmetic intensity.
7. CUDA Programming Fundamentals
7.1 Host-device model
CUDA applications split into:
- host code (CPU, orchestration + memory management),
- device kernels (GPU, massive data-parallel execution).
Kernel launch:
Kernel<<<numBlocks, blockSize>>>(...);
7.2 Thread organization and SIMT
Threads are grouped into blocks; blocks form a grid. Mapping 1D/2D problems to grid/block geometry is central for performance and correctness.
7.3 Memory behavior
Global DRAM bandwidth is often limiting. Performance requires:
- coalesced global memory access,
- shared-memory tiling,
- minimizing redundant loads,
- avoiding unnecessary synchronization/divergence.
8. CUDA Patterns: Reduction, Histogram, Matrix Multiplication
8.1 Reduction
Reduction (sum/max/min) is associative pattern:
- partition input,
- local partial reductions,
- tree-style combine.
Parallel reduction gives $O(\log n)$ depth with enough threads.
8.2 Histogram
Histogram updates can create high contention.
Techniques in slides:
- atomic updates (global/shared memory),
- privatization (per-block local bins + merge),
- partitioning strategy tradeoffs.
8.3 Matrix multiplication on GPU
Naive kernel performs many global reads and is memory-bound. Tiled shared-memory kernel reuses subblocks and greatly reduces DRAM traffic.
9. Parallelism for ML Workloads
Final lecture links to feedforward neural networks and logistic regression training.
Core kernels:
- matrix multiplication,
- elementwise activation,
- reduction operations in loss/gradient calculations.
Training loop relies on forward + backward propagation; GPU/TPU acceleration is effective because these operations are high-throughput linear algebra primitives.
10. Design Checklist (Practical)
- Start from correctness and dependency graph.
- Quantify baseline (
T_s, bandwidth, cache misses). - Increase locality before increasing thread/process count.
- For OpenMP: choose schedule based on load variance.
- For MPI: reduce communication frequency/volume and overlap where possible.
- For CUDA: coalesce access, use shared memory tiling, minimize atomics contention.
- Validate speedup with both runtime and efficiency, not runtime alone.
11. Compact Formula Sheet
- Speedup: $S_p=T_s/T_p$.
- Efficiency: $E_p=S_p/p$.
- Amdahl bound: $S_p\le1/(f+(1-f)/p)$.
- GEPP factorization: $PA=LU$.
- Typical matrix multiply work: $O(n^3)$ compute, but performance depends on data movement.
These notes are meant as a concise technical map from multicore CPU programming to distributed computing and GPU acceleration.