Pipeline parallelism partitions the model “vertically” into layers. You can also split certain operations “horizontally” within a layer. This is usually tensor parallel training. Many latest models ( transformer), the computational bottleneck is multiplying the activation batch matrix by a large weight matrix. matrix multiplication You can think of it as a dot product between row and column pairs. It is possible to compute independent dot products on different GPUs, or compute a portion of each dot product on different GPUs and sum the results. Both strategies involve slicing the weight matrix into evenly sized “shards”, hosting each shard on a separate GPU, and using that shard to compute the relevant portion of the overall matrix product before communicating later. and combine the results.
As an example, Megatron LMparallelize matrix multiplication within the Transformer’s self-attention and MLP layers. PTD-P Use tensors, data, and pipeline parallelism. Its pipeline schedule assigns multiple non-contiguous layers to each device to reduce bubble overhead at the expense of more network communication.
In some cases, inputs to the network can be parallelized across dimensions using highly parallel computation compared to intercommunication. Sequence parallelism This is one such idea, where an input sequence is split into multiple subsamples over time, allowing computations to proceed with finer-grained samples and peak memory consumption proportionally. decreases.