Recurse Center: W3D2 - Impossible Stuff Day

Mar 9

I got my previously trained world model working for a distributed training set up (2 gpus, 1 node) and it runs through locally!! This is for data parallelism which means that each gpu will see a distinct slice of the data but the model will be replicated on both gpus.

I started by checking that I still had that 100GB of training data and ported that data to my server to benchmark the time it took to train on 1 gpu as well that the model trained without error. I used tailscale to connect my laptop where my data was sitting to my home server while on an external network.
Looked through some guides about DDP implementation and chatted with gemini to get a mental model of what needs to change. Then I went into my current training loop and made changes to handle distributed training -> distributed training loop
Next, I created a dummy dataset with the same shape as my actual dataset but much smaller so that I could test my distributed training quickly as well as test the orchestration of the distributed training by running locally using the communication backend "Gloo" instead of "NCCL" because "Gloo" is hardware agnostic and does not require GPUs to test.

It's good to build in public. I was worried that I would have nothing much to present by the end of the day but I guess that is not the point and the point is to try challenging myself and see what happens. And in this case, I guess my task was not that hard after all..just unknown at the time.

Fiona Chow

Recurse Center: W3D2 - Impossible Stuff Day

Recurse Center: W3D3/D4

Recurse Center: W3D1