Distributed training of a GPT model (part 2): pipeline parallelism, Megatron-LM tensor parallelism and communication quantization

1 · Bruno Magalhaes · Aug. 30, 2023, midnight
This post follows from the previous post Distributed training of a GPT model using DeepSpeed. We discussed that an ML model allows for three dimensions of parallelism, on Data, Pipeline and Tensors/Models. We covered distributed data parallellism and sharded data parallelism in the previous post. Here we will discuss pipeline and model (tensor) parallelism. The 3D parallelism aims and partitioning (color-coded) computer resources across the 3D space of data, pipeline and tensor (model) dimensi...