Memory is All You Need

This week I worked on:

1] Reading the FSDP docs and trying model parallelism on a simple script
2] Integrating FSDP into current RL training script
3] Updating resume and applying to a few jobs
4] Reading chapter 1 of DDIA (second edition) + reading group
5] Had a great time pairing with @Robbie Gordan (he) (SP2'26) and @Jacky Lu (he) (SP2'26) on Open AI Parameter golf challenge. Thank you Robbie for sharing all your insights from working on this challenge. Also had a nice coffee chat with @Mark Mayes (he) (SP2'26)where I learned about his project on using FFTs/Flutter to build an app mapping audio energy to visual movement in the style of electric sheep.

GPU-maxxxxing on 2- GPUs at 95% util each.

I have a MFU of 45% and HFU of 60% i.e. good model and hardware efficiency.

MFU (Model FLOPs Utilization) measures the theoretical minimum training FLOPs demanded by the model architecture (just forward + backward pass) against the peak FLOPs of the GPU.

HFU (Hardware FLOPs Utilization) measures the actual executed FLOPs (like the extra engineering which optimizes for memory such as activation checkpointing which trades FLOPs by doing the math twice to save bytes) against the peak FLOPs of the GPU.

These differ from the GPU utilization of 95% in that GPU utilization measures percentage of time at least one GPU kernel was active which does not measure efficiency since a tiny kernel could use 1% of the GPU's computation units but take 10ms to run and the GPU utilization would be 100% for those 10ms. And so the GPU could look busy but is mostly idle and waiting for data to arrive from the VRAM/HBM. And one way to inspect that is to look at the % time spent in accessing memory in GPU.

I could increase efficiency by increasing batch size (and thus throughput) but I am quickly met with GPU out of memory errors. I am starting to see some of the memory vs throughput tradeoffs in these systems. Even with a large quantized model of 14B parameters across 2 GPUs, I am still memory bound RL fine-tuning my computer agent to play chinese chess.

It turns out "Memory is All You Need".

Next
Next

GPU programming and more