Rendering note: The worked examples (in repo EXAMPLES.md) use Mermaid diagrams and LaTeX math blocks. They render natively on GitHub and in VS Code with the appropriate extensions. If diagrams or equations are not rendering correctly, view the files on GitHub.

You are an ML engineer at a water utility company. Your team has deployed an anomaly-detection pipeline (a PyTorch autoencoder) to NVIDIA Jetson edge devices at remote pump stations. The Python code works, but it runs on CPU by default.

Management asks: “These Jetsons have GPUs. Why aren’t we using them? GPU should be faster, right?”

Your job: add GPU support, benchmark CPU vs GPU, and discover the answer.

Spoiler: for this tiny model, the answer is surprising.

You will:

  1. Install GPU-enabled PyTorch on a Jetson (follow the setup guide for your JetPack version)
  2. Add a --device argument to the detection pipeline
  3. Write a benchmarking script that compares CPU vs GPU with proper timing
  4. Run a batch size sweep to understand GPU utilization
  5. Reflect on the results and explain why GPU is slower for tiny models

GPU & CUDA Background

What Is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform. It lets you run code on the GPU’s thousands of cores instead of the CPU’s handful of cores.

PyTorch uses CUDA transparently: you move tensors to the GPU with .to("cuda"), and all operations on those tensors run on the GPU. The model, the data, and the loss computation all need to be on the same device.

device = torch.device("cuda")          # select GPU
model = MyModel().to(device)           # model parameters → GPU memory
x = torch.randn(32, 6).to(device)     # input tensor → GPU memory
output = model(x)                      # computation happens on GPU

CPU vs GPU: When Does GPU Win?

GPUs are not always faster. They excel when:

Factor CPU wins GPU wins
Model size Tiny (< 10K params) Large (> 100K params)
Batch size Small (< 32) Large (> 256)
Data transfer Dominates compute time Negligible vs compute
Operation type Sequential, branchy Parallel, regular (matrix multiply)

The key insight: GPU operations have overhead — kernel launches, memory transfers, synchronization. If the actual computation is tiny (like our ~1,400 parameter autoencoder), this overhead dominates and the GPU is slower than CPU.

Think of it like shipping a single letter by cargo ship vs walking it next door. The cargo ship has massive capacity, but the loading/unloading overhead makes it slower for small payloads.

Jetson Unified Memory

Desktop GPUs (RTX 4090, A100) have discrete memory — separate RAM chips on the GPU card. Moving data between CPU RAM and GPU RAM requires a PCIe transfer.

Jetson devices use unified memory — the CPU and GPU share the same physical RAM. This means:

  • No PCIe transfer bottleneck (data is already “there”)
  • But the GPU still has kernel launch overhead
  • And the GPU must still synchronize before you can read results on CPU
  • Memory bandwidth is shared between CPU and GPU (contention)

This makes Jetson an interesting benchmarking platform: the GPU overhead you measure is almost entirely kernel launch + synchronization, not data transfer.

Asynchronous GPU Execution and Timing

GPU operations in PyTorch are asynchronous. When you call model(x) on CUDA, Python returns immediately — the GPU work is merely queued, not finished.

# WRONG — measures time to LAUNCH, not time to FINISH
t0 = time.perf_counter()
output = model(x)                    # returns instantly! GPU still working!
elapsed = time.perf_counter() - t0   # measures ~0.001 sec (launch time only)

# RIGHT — measures actual GPU computation time
torch.cuda.synchronize()             # wait for any prior GPU work
t0 = time.perf_counter()
output = model(x)
torch.cuda.synchronize()             # wait for THIS work to finish
elapsed = time.perf_counter() - t0   # measures real GPU time

Without torch.cuda.synchronize(), your GPU benchmarks will be wildly inaccurate (usually showing GPU as impossibly fast).

Batch Size and GPU Utilization

A GPU has thousands of cores. If your batch has 32 samples and each sample only needs a few multiply-adds, most cores sit idle. Larger batches give the GPU more work to do in parallel:

Batch size 32:   [████░░░░░░░░░░░░]  ~10% GPU utilization
Batch size 256:  [████████████░░░░]  ~70% GPU utilization
Batch size 1024: [████████████████]  ~95% GPU utilization

But even at 100% utilization, a tiny model may still be slower on GPU than CPU because the per-kernel overhead is a larger fraction of total time.

Warmup Runs

The first GPU operation in a PyTorch session is slow because CUDA must:

  1. Initialize the GPU context (~0.5–2 sec on Jetson)
  2. JIT-compile CUDA kernels for your specific operations
  3. Allocate GPU memory pools

Always do a warmup run before timing. Discard the first run’s timing. Real-world deployment doesn’t pay this cost repeatedly — only at startup.


Repository Structure

pumpwatchGPU/
├── data/
│   ├── pumpWatchTelemetry_30min_50Hz.csv
│   ├── pumpWatchTelemetry_manifest.json
│   ├── pumpWatchTelemetry2_30min_50Hz.csv
│   └── pumpWatchTelemetry2_manifest.json
├── scripts/
│   ├── detectAnomalies.py          ← working pipeline (TODO 2: add --device)
│   └── benchmarkGPU.py            ← stub (TODO 3a–3e: write benchmark)
├── logs/                           ← output directory
├── requirements.txt                ← Python dependencies
├── reflection.txt                  ← TODO 5: fill with real numbers
├── EXAMPLES.md                     ← diagrams and visual explanations
├── JETSON_SETUP_JP7.md            ← setup guide for JetPack 7.x (e.g. athor01)
├── JETSON_SETUP_JP5.md            ← setup guide for JetPack 5.x (e.g. aorin01)
└── README.md                       ← this file

TODOs

TODO 1: Install GPU-Enabled PyTorch on Jetson

Follow the setup guide for your machine’s JetPack version. Check your hostname first:

hostname

Then follow the matching guide:

Hostname JetPack Setup Guide
athor01 7.1 (L4T R38, CUDA 13.0, Python 3.12) in repo JETSON_SETUP_JP7.md
aorin01 5.1.2 (L4T R35, CUDA 11.4, Python 3.8) in repo JETSON_SETUP_JP5.md

Not on one of these machines? Run head -1 /etc/nv_tegra_release to identify your L4T version, then pick the matching guide:

L4T Version JetPack Version Where to find setup info
R38.x JetPack 7.x In repo: JETSON_SETUP_JP7.md
R36.x JetPack 6.x See the Jetson AI Lab JP6 index
R35.x JetPack 5.x In repo: JETSON_SETUP_JP5.md

When done, verify:

python3 -c "import torch; print(torch.cuda.is_available())"
# Must print: True

If this prints False, stop, the remaining TODOs require CUDA.

TODO 2: Add --device Argument to detectAnomalies.py

The pipeline currently auto-detects the device inside trainModel() and reconstructionMse(). Your task is to give the user explicit control.

What to add:

  1. Argument: Add --device to the argparse block with choices=["cpu", "cuda", "auto"] and default="auto"

  2. resolveDevice() function: Takes the user’s choice string and returns a torch.device:

    • "auto" → CUDA if available, else CPU
    • "cpu" → always CPU
    • "cuda" → CUDA, but raise an error if torch.cuda.is_available() is False
  3. Thread the device into trainModel(): Remove the inline device = torch.device("cuda" if ... else "cpu") line from trainModel() and add a device parameter to its signature instead. Pass the resolved device from main().

    Note: reconstructionMse() already gets the device from the model via next(model.parameters()).device, so it does not need a signature change — once the model is on the right device, inference follows automatically.

  4. Print at startup: After resolving, print which device is being used:

    Using device: cuda (<GPU name from torch.cuda.get_device_name()>)
    

    Examples: Using device: cuda (Orin) on aorin01, Using device: cuda (NVIDIA Thor) on athor01.

Verify:

python3 scripts/detectAnomalies.py --device cpu \
  --csv data/pumpWatchTelemetry_30min_50Hz.csv \
  --manifest data/pumpWatchTelemetry_manifest.json \
  --outputCSV logs/anomalyScores.csv \
  --outputPlot logs/anomalyDetection.png

python3 scripts/detectAnomalies.py --device cuda \
  --csv data/pumpWatchTelemetry_30min_50Hz.csv \
  --manifest data/pumpWatchTelemetry_manifest.json \
  --outputCSV logs/anomalyScores.csv \
  --outputPlot logs/anomalyDetection.png

Both should produce valid output. Results may differ very slightly between CPU and GPU due to floating-point non-determinism.

TODO 3: Write the Benchmarking Script (benchmarkGPU.py)

Open scripts/benchmarkGPU.py. The imports, argument parser framework, and helper functions (prepareData, trainOnDevice, inferOnDevice) are provided. You need to implement 5 functions:

TODO 3a: timePhase(func, device, warmup=True)

Accurately time a function with proper GPU synchronization. See the detailed comment block in the code.

TODO 3b: benchmarkOneConfig(trainX, allX, device, ...)

Run the full pipeline (train + inference) on one device and return a timing dict. Reset seeds before each run for reproducibility.

TODO 3c: computeSpeedups(cpuResults, gpuResults)

Compute per-phase speedups as CPU mean / GPU mean.

TODO 3d: generateReport(cpuResults, gpuResults, speedups, batchSweep)

Format a human-readable text report with timing tables and speedup summary.

TODO 3e: main()

Orchestrate everything: parse args, prepare data, run CPU benchmarks, run GPU benchmarks, compute speedups, optional batch sweep, save JSON + text output.

Verify:

python3 scripts/benchmarkGPU.py \
  --csv data/pumpWatchTelemetry_30min_50Hz.csv \
  --outputDir logs \
  --runs 3

Should produce logs/benchmarkResults.json and logs/benchmarkReport.txt.

TODO 4: Batch Size Sweep

Run the benchmark with multiple batch sizes:

python3 scripts/benchmarkGPU.py \
  --csv data/pumpWatchTelemetry_30min_50Hz.csv \
  --outputDir logs \
  --runs 3 \
  --batchSizes 32,64,128,256,512,1024

Examine how batch size affects GPU vs CPU timing. Use the results to fill in Section 3 of reflection.txt.

Note: The benchmark includes warmup runs, so each configuration trains the model twice (once discarded). With the full batch sweep this takes approximately 5–10 minutes. This is expected.

Variability: These are shared machines. Other users’ processes, thermal state, and CPU frequency scaling can cause 10–15% run-to-run variability. This is normal — report what you measure and note the standard deviation. If your numbers look very different between runs, try running at an off-peak time or increasing --runs to 5.

TODO 5: Reflection

Fill out every section of reflection.txt with real numbers from your Jetson runs. Replace every [___] placeholder with actual data and analysis.


Setting Up Your Environment

First, identify your machine’s JetPack version by running hostname, then follow the matching setup guide:

Hostname JetPack Guide
athor01 7.x in repo JETSON_SETUP_JP7.md
aorin01 5.x in repo JETSON_SETUP_JP5.md

Each guide covers the complete setup: creating the venv, installing the correct PyTorch wheel for that JetPack version, verifying GPU access, and troubleshooting.

After completing the setup guide, verify the pipeline runs:

source .venv/bin/activate
python3 scripts/detectAnomalies.py \
  --csv data/pumpWatchTelemetry_30min_50Hz.csv \
  --manifest data/pumpWatchTelemetry_manifest.json \
  --outputCSV logs/anomalyScores.csv \
  --outputPlot logs/anomalyDetection.png

Evaluation Criteria

Criterion Weight
PyTorch GPU installed and torch.cuda.is_available() == True 10%
--device argument works correctly (cpu, cuda, auto) 20%
Benchmark script runs and produces correct output 30%
Batch size sweep with analysis 15%
Reflection with real numbers and thoughtful analysis 25%

Additional Resources

  • PyTorch CUDA Semantics Official PyTorch documentation explaining how CUDA execution works within PyTorch, including asynchronous behavior, streams, device placement, and memory management. This is the authoritative reference for understanding GPU execution timing, synchronization requirements, and common performance pitfalls when benchmarking.
  • NVIDIA Jetson Developer Guide Official NVIDIA documentation covering Jetson hardware platforms, software stack components, JetPack SDK details, and system configuration guidance. This resource is essential for understanding architectural constraints, available accelerators, and platform-specific optimization considerations.
  • PyTorch Benchmarking Best Practices Official PyTorch tutorial describing correct benchmarking methodology, including warmup iterations, CUDA synchronization, timer selection, and reproducibility practices. This is the primary reference for building reliable and scientifically valid performance measurements.
  • Understanding GPU Utilization NVIDIA technical blog post explaining CUDA occupancy, kernel launch configuration, and how hardware resources influence GPU utilization. This resource provides architectural insight into why measured GPU utilization may not directly correlate with perceived workload intensity.