文章来源地址https://uudwc.com/A/6X42x
A Reading List for MLSys
An Overview of Distributed Methods | Papers With Code
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
https://arxiv.org/abs/1910.02054
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
https://arxiv.org/abs/2104.04473
Reducing Activation Recomputation in Large Transformer Models
https://arxiv.org/abs/2205.05198
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
https://arxiv.org/abs/1909.08053
Fully Sharded Data Parallel: faster AI training with fewer GPUs
Fully Sharded Data Parallel: faster AI training with fewer GPUs Engineering at Meta -
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
https://arxiv.org/pdf/2006.16668.pdf
GSPMD: General and Scalable Parallelization for ML Computation Graphs
https://arxiv.org/pdf/2105.04663.pdf
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
https://arxiv.org/abs/2004.13336v1文章来源:https://uudwc.com/A/6X42x