What is the Zero Redundancy Optimizer (ZeRO)?

Definition: What is the Zero Redundancy Optimizer (ZeRO)?

The Zero Redundancy Optimizer (ZeRO) is fully integrated with the DeepSpeed ​​Library to optimize and optimize the Speedhead for the training of large Deep-Learning models with many billion Parameters. ZeRO was created by Microsoft and is available under Free-Source License. Reference was made for the Zero Redundancy Optimizer speed control for the training of the GPT-2 model and the Turing-NLG 17B.

Post subject:

(Image: © aga7ta – stock.adobe.com)

Zero Redundancy Optimizer is the name of one of the Microsoft Sets of Optimization Techniques for the Training of very large Deep-Learning Models. The Optimizer is part of the Python Library DeepSpeed ​​and under the Open Source License free of charge. ZeRO sorgt for the Speedherperfars Optimization without the data- and model parallelization and the result is, very large Models in one of the real Computing Frames in the specialty GPUs useful for training. You can change the model within the Million Billion Parameter.

To quickly get the Zero Redundancy Optimizer recommended for the training of the GPT-2 model and the Turing-NLG 17B. If the Optimizer does not work, the code passages of the Models are not valid. The output of the track is calculated from the Zero Redundancy Optimizer for the Deep-Learning Model with over a Billion Parameters.

Motivation for DeepSpeed ​​and ZeRO Entertainment:

Large Deep-Learning Models are available for training purposes, such as training applications and workshops for computer and GPUs. Current models Models such as Sprachmodels add a different number of parameters. Each hand is worth a Billion Parameters. The size of the model is extremely small, and in this case a single GPU is loaded. More complete architectures with as many GPUs and techniques as data and model parallelization to enable. Even the classic data- and model-parallelization shows its limits. For efficient Training your solutions are closed, the redundancies in the data- and model-parallelization reductions and scaling over your Computing-Resources will be added. DeepSpeed ​​and the Zero Redundancy Optimizer provide solutions for these applications. ZeRO is scaled for a large Model with up to one Billion Parameters.


The Zero Redundancy Optimizer comes with DeepSpeed ​​to enable. DeepSpeed ​​provides you with a library for the open source Source Learning Machine on the PyTorch Framework. You have been hired by Microsoft and will be released by 2020. The Library is located under the MIT Open Source Source License and is currently available on GitHub. Mithilfe von DeepSpeed ​​provides large Deep-Learning models on the equipment and in fact the effective and efficient computing skills are trained.

One of the main advantages of DeepSpeed ​​is the large number of training models with many parameters. DeepSpeed ​​works with Parallelization Techniques and uses the Zero Redundancy Optimizer, a large model with billions of parameters to verify the GPUs in the workstations to train.

Basic Functionality of Zero Redundancy Optimizers:

Ziel des Zero Redundancy Optimizers is, the Speedheaded Large Models are so optimized, that one efficient, well-trained Training is possible. The Optimizer Technicians support existing solutions such as Database Parallelization (DP) and Model Parallelization (MP) and source for Optimization. Dementia-free Zero Redundancy Optimizer with a ZeRO-DP- and a ZeRO-MP-Optimizer is supported.

Basic data detection data for the model input is found, which now has a specific GPU pass. The model is quasi-replicated and the actual data processes are trained. Modellparallelization will be enabled if the model does not have a specific GPU pass. Classic modeling parallelism has been added to the Nachteil, while the interfaces of the Inter-GPU Communication with the limited number of GPUs on the main Rehnknoten stark abnimmt. A limiting factor is the relative strength of the network bandwidth. The Zero Redundancy Optimizer has a wide range of data limits and model parallelization, while reducing redundancies and speed specification speeds are sufficient. It is partitioned, explicitly displayed, the Models and the Modelsstatus for the Virtual Architecture on the Reckoning to display the GPUs, starting with the replicated DP and MP Mechanisms only. Each GPU has a partition of models on it and it is trained with a single part of the data. Redundancy should be avoided. The performance and performance requirements of the GPUs and the training will be reduced.

The Large Deep-Learning Model is calculated using the Zero Redundancy Optimizers in proportion to the range of the recurring output. Up to now Hardware installers become models with a Billion Parameters trainable. The various implementations of the Zero Redundancy Optimizer should have increased operating safety. Large Models with over 100 Billion Parameters with up to 400 GPUs vertically up to 400 GPUs with a range of 15 Petaflops trained.

(ID: 48177241)

Leave a Comment