Microsoft GitHub open source DeepSpeed ​​Chat simplifies training similar to ChatGPT hundreds of billions of models

The time and capital cost of training a large language model like GPT is extremely high, which is not affordable for ordinary people and enterprises. Microsoft open-sourced DeepSpeed ​​Chat on GitHub, reducing the training cost and time spent on hundreds of billions of large models by 15 times. Taking the 175 billion parameter model as an example, 64x NVIDIA A100 GPUs can complete the training in 20 hours on Azure, costing about $5,120.

Large language models, while powerful, cannot generate content tailored to an organization’s unique needs. Large-scale language models also involve third-party cloud services, so companies can only use non-sensitive data. Enterprises expect to train their own models and use them in a wider range of business.

Microsoft’s open-source DeepSpeed ​​Chat can help users train large-scale language models by themselves. Based on deep learning optimization, it has training and reasoning functions. This solution uses a script to execute a series of reinforcement learning from human feedback (Reinforcement Learning from Human Feedback, RLHF), which is the core technology for training the ChatGPT model.

DeepSpeed ​​Chat has three functions:

Simplify training ChatGPT type models and enhance inference experience: utilize one playbook for multiple training steps, including using Huggingface pre-trained models, running InstructGPT training with DeepSpeed-RLFH system. DeepSpeed-RLHF module: Use the training mode of the InstructGPT paper, including supervised fine-tuning (SFT), reward model fine-tuning, and RLHF. DeepSpeed-RLHF system: integrate DeepSpeed’s training engine (training engine) and inference engine (inference engine) into a hybrid engine (DeepSpeed-HE) for RLHF training. DeepSpeed-HE is able to switch between inference and training modes in RLHF, and can also take advantage of various optimizations from DeepSpeed-Inference, such as tensor parallel operations and high-performance CUDA generation language, while training from ZeRO- and LoRA-based for memory optimization.

The DeepSpeed-HE training model is fast and low-cost. For example, training a 13 billion parameter model on Azure, using 8x NVIDIA A100 infrastructure, takes about 9 hours and costs only $290. If the model expanded to 30 billion parameters also uses 8x NVIDIA A100, it only takes 18 hours and $580. Training a more complex model like 175 billion parameters with 64x NVIDIA A100 took 200 hours and $5,120.

DeepSpeed ​​Chat significantly lowers the barriers to entry for companies to train large models on their own. Enterprises can integrate industry knowledge and unique enterprise needs into the model to support more innovative AI applications.