Parallel Programming Notes (OpenMP, MPI, CUDA)

1. Big Picture

Parallel performance is constrained by three coupled factors:

  1. Algorithmic parallelism (how much work can be done concurrently),
  2. Data movement (memory hierarchy and communication cost),
  3. Synchronization overhead (dependencies and coordination).

A practical workflow is:


2. Performance Metrics and Limits

Let $T_s$ be best serial time and $T_p$ parallel time on $p$ processors:

\[S_p = \frac{T_s}{T_p}, \qquad E_p = \frac{S_p}{p}.\]

Amdahl-style bound with serial fraction $f$:

\[S_p \le \frac{1}{f + \frac{1-f}{p}}.\]

Implications:


3. Memory Hierarchy and Locality

Slides emphasize that many programs run far below peak because of memory bottlenecks.

Hierarchy: registers -> L1/L2/L3 cache -> DRAM -> secondary storage.

Optimization principles:

For matrix multiplication, naive loops are often bandwidth-limited. Blocking raises cache reuse and moves performance toward compute-bound behavior.


4. Shared-Memory Parallelism (OpenMP)

4.1 Execution model

OpenMP provides directive-based multithreading with shared variables and private thread-local variables.

Core constructs:

4.2 Scheduling

for schedules control iteration distribution:

4.3 Dependency considerations


5. Distributed-Memory Parallelism (MPI)

5.1 Model

Each process has private memory; explicit communication is required.

Design steps:

  1. partition data,
  2. assign work,
  3. design communication pattern,
  4. minimize communication volume + synchronization.

5.2 Point-to-point and nonblocking

Nonblocking communication allows overlap of communication and computation when dependencies permit.

5.3 Collectives and data movement

Collectives structure frequent communication motifs; derived datatypes reduce manual packing/unpacking for non-contiguous data.


6. Parallel Dense Linear Algebra Case Study: GEPP

Gaussian Elimination with Partial Pivoting (GEPP):

\[PA = LU.\]

Partial pivoting improves practical numerical stability by selecting large column pivots.

Performance issue: naive elimination relies heavily on BLAS-1/2 operations; modern hardware favors BLAS-3.

Key optimization from slides:

This is the recurring HPC theme: reformulate for higher arithmetic intensity.


7. CUDA Programming Fundamentals

7.1 Host-device model

CUDA applications split into:

Kernel launch:

Kernel<<<numBlocks, blockSize>>>(...);

7.2 Thread organization and SIMT

Threads are grouped into blocks; blocks form a grid. Mapping 1D/2D problems to grid/block geometry is central for performance and correctness.

7.3 Memory behavior

Global DRAM bandwidth is often limiting. Performance requires:


8. CUDA Patterns: Reduction, Histogram, Matrix Multiplication

8.1 Reduction

Reduction (sum/max/min) is associative pattern:

Parallel reduction gives $O(\log n)$ depth with enough threads.

8.2 Histogram

Histogram updates can create high contention.

Techniques in slides:

8.3 Matrix multiplication on GPU

Naive kernel performs many global reads and is memory-bound. Tiled shared-memory kernel reuses subblocks and greatly reduces DRAM traffic.


9. Parallelism for ML Workloads

Final lecture links to feedforward neural networks and logistic regression training.

Core kernels:

Training loop relies on forward + backward propagation; GPU/TPU acceleration is effective because these operations are high-throughput linear algebra primitives.


10. Design Checklist (Practical)

  1. Start from correctness and dependency graph.
  2. Quantify baseline (T_s, bandwidth, cache misses).
  3. Increase locality before increasing thread/process count.
  4. For OpenMP: choose schedule based on load variance.
  5. For MPI: reduce communication frequency/volume and overlap where possible.
  6. For CUDA: coalesce access, use shared memory tiling, minimize atomics contention.
  7. Validate speedup with both runtime and efficiency, not runtime alone.

11. Compact Formula Sheet

These notes are meant as a concise technical map from multicore CPU programming to distributed computing and GPU acceleration.