3 minute read

Compute Requirements

\[C\approx \tao T \eq 6 PD\]

where \(C_{forward} \approx 2 PD\) and \(C_{backward} \approx 4 PD\)

  • Actual FLOPs are different than the advertised FLOPs in GPU accelerator papers. Theoretical FLOPs never are achieved in practice (especially in distributed settings).

  • “Chinchilla Scaling Laws” propose that $D=20P$ is the optimal number of parameters and the size of data, but this only counts where using 1000 GPUs for 1 hour and 1 GPU for 1000 hours cost you the same amount (when your goal is to maximize performance while minimizing the cost in GPU hours to train).

  • However, some opinions recommend determining what inference cost is acceptable for your use case and training the largest model you can after staying under the inference budget.

  • For any compute under 115 TFLOP/s/A100, you should safely assume that there is something wrong with your model or hardware configuration.

Memory Requirements

  • You need to know how much space the model will take up in bytes. This can tell you how large a model will fit on your local GPU for inference, or how large a model you can train across your cluster with a certain amount of total accelerator memory.

  • Most transformers are trained on mixed precision, either fp15 + fp32 or bf16 + fp 32, which cuts down the amount of memory required to train a model.

  • int8 memory = \((1 byte/ param) * No.Params\)
  • fp16, bf16 memory = \((2 byte/ param) * No.Params\)
  • fp 32 memory = \((4 byte/ param) * No.Params\)

  • Automatic Mixed Precision for Deep Learning: training half precision while maintaining the network accuracy achieved with single precision. This speeds up math-intensive operations, such as linear and convolution layers, and memory-limited operations by assessing half the bytes compared to single-precision. It also reduces memory requirements for training models, enabling larger models or minibatches.

  • Optimizers: Adam is magic, but it is highly memory inefficient as it has to keep track of 3 gradient copies.

  • Vanilla AdamW memory = 4 bytes/param (fp 32 copy of param) + 4 bytes/param (momentum) + 4 bytes (variance)
  • bitsandbytes 8-bit optimizer = 4 bytes/param (fp32 copy of param) + 1 (momentum) + 1 (variance)
  • SGD-like optimizer = 4 bytes/param (fp 32 copy of param) + 4 bytes from Momentum

  • If gradients are stored as fp16, it would result in 2 bytes/param

  • Modern GPUs are typically bottlenecked by memory, not FLOPs, for LLM training. Therefore, activation recomputation/checkpointing is an extremely popular method of trading memory costs for extra compute costs.

  • Activation recomputation/checkpointing works by recomputing activations in certain layers instead of storing them in GPU memory. The reduction in memory depends on how selective we are when deciding which layer to clear.
  • In combining all this, a good heuristic answer memory needed for training is Model + optimizer + activation + gradient memory

Distributed Training

  • Data parallelism: split the data among (possibly model-parallel) replicas of the model
  • Pipeline or Tensor/Model parallelism: these parallelism schemes split the parameters of the model across GPUs
  • ZeRO with sharded optimizers

4D parallelism in Llama 3.1 405B Model

  • Tensor parallelism: splits individual weight tensors across multiple devices, allowing the model to leverage the combined memory of multiple GPUs
  • Pipeline parallelism: divides the model vertically into stages by layers, enabling different devices to process these stages in parallel and creating an efficient pipeline for data flow through the model
  • Context parallelism: dividing input context into segments, which is crucial for enabling long-context understanding (reduces the memory bottleneck for very long sequence lengths)
  • Data parallelism: also known as Fully Shared Data Parallelism (FSDP), it shards model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded to different GPUs, the computation for each micro batch of data is still local to each GPU worker.

Tags:

Categories:

Updated: