大模型加速方法

accelerate

Accelerate支持使用DeepSpeed在单个/多个GPU上进行训练

安装包

pip install accelerate

代码部分

初始化accelerator对象

from accelerate import Accelerator
accelerator = Accelerator()

用accelerator对象中的device替代pytorch指定的device
```
device = accelerator.device
```

替换print函数，如果不进行替换，则会导致每个进程都会输出，所以将print函数交给accelerator进行处理
```
accelerator.print("Creating model")
```
在一些特殊情况，例如要修改metric_logger内部的输出函数，需要向内部传入accelerator内部的print函数均改为custom_print

替换分布式初始化环境代码，舍弃之前的分布式初始化模块，并给每个进程设置随机种子

# utils.init_distributed_mode(args)    
# fix the seed for reproducibility
seed = args.seed 
if accelerator.state.distributed_type is not None:
    # 在分布式环境中，根据进程索引调整种子以保持唯一性
    seed += accelerator.state.local_process_index

替换save/load state函数

# 加载模型的代码，如果load_checkpoint_and_dispatch有修改，还需要对该函数针对性的改代码
if args.checkpoint:
        accelerator.load_state(os.path.join(args.checkpoint, 'states'))
        others_checkpoint = torch.load(os.path.join(args.checkpoint, 'others.pth')) 
        start_epoch = others_checkpoint['epoch']
        config = others_checkpoint['config']
        model = load_checkpoint_and_dispatch(model, os.path.join(os.path.dirname(args.checkpoint), 'model.safetensors'))
        accelerator.print('resume checkpoint from %s'%args.checkpoint)

# 保存模型和状态的代码
accelerator.save_state(os.path.join(args.output_dir, f"ckpt_{epoch:02d}", 'states'), safe_serialization=True)
            # model.save_checkpoint(os.path.join(args.output_dir, f"ckpt_{epoch:02d}", 'states'))
            save_obj = {
                'config': config,
                'epoch': epoch,
            }
            accelerator.save(save_obj, os.path.join(args.output_dir, f"ckpt_{epoch:02d}", 'others.pth'))

对dataloader、优化器、学习率调度器重新包装

model, optimizer, train_loader, test_loader = accelerator.prepare(
        model, optimizer, train_loader, test_loader)

反向传播部分

with accelerator.autocast():
            loss_mlm, loss_ita, loss_itm = model(image, text_input, alpha = alpha)      
            loss = loss_mlm + loss_ita + loss_itm    
          
 accelerator.backward(loss)
 optimizer.step()

去除Pytorch自带的DDP代码类似有关于原代码中关于分布式的都去除
替换多进程同步函数，判断主进程的代码和等待同步的代码都要替换
```
accelerator.wait_for_everyone()
accelerator.is_main_process
```

生成配置文件

运行accelerate之前，需要确保你的包内accelerate，deepspeed和cuda都存在且版本合适

在开始训练之前，需要配置accelerate的脚本（在此之前，你可能还需要安装deepspeed）

首先在命令行内运行指令：

accelerate config

然后会向你进行提问，根据你回答的内容生成一个yaml文件

关于它的问题：

这个是一个单机多卡的配置，其中gradient accumulation step一般设为1到8之间，A100推荐使用bf16，其余的机器使用fp16。

最后需要将其移动到项目内。

此外还需要添加一个json的配置文件（zero_stage2.json），其中有的参数（例如gradient_accumulation_steps，fp16）可以根据自己的实验调整

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 8,
    "gradient_clipping": 1.0,
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

需要将这两个配置文件放在一个项目内单独的文件夹之内

设置了bf16之后，需要将图像转换再输入模型之中

image = image.half()

启动

accelerate launch --config_file ./dsconfigs/default.yaml Pretrain.py

菜单

分享

Accelerate使用

大模型加速方法

accelerate

安装包

代码部分

生成配置文件

启动

相关资料

【0号教程】书写代码的必要前置工作

服务器代理配置指南

Vscode连接服务器失败的解决方案

记录一次GPU集群的存储崩溃

ddwgroup共享文件夹使用方式

服务器基础环境安装

Accelerate使用

Gitea简要使用教程

配置IB网卡的ip地址和测试带宽

论文学习

分享