gpt4 book ai didi

python - 如何解决 dist.init_process_group 挂起(或死锁)?

转载 作者:行者123 更新时间:2023-12-03 20:43:03 32 4
gpt4 key购买 nike

我打算在 DGX A100 上设置 DDP(分布式数据并行),但它不起作用。每当我尝试运行它时,它就会挂起。我的代码非常简单,只是为 4 个 gpu 生成了 4 个进程(为了调试,我只是立即销毁了该组,但它甚至没有到达那里):

def find_free_port():
""" https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
import socket
from contextlib import closing

with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(('', 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
return str(s.getsockname()[1])

def setup_process(rank, world_size, backend='gloo'):
"""
Initialize the distributed environment (for each process).

gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

https://stackoverflow.com/questions/61075390/about-pytorch-nccl-error-unhandled-system-error-nccl-version-2-4-8

https://pytorch.org/docs/stable/distributed.html#common-environment-variables
"""
if rank != -1: # -1 rank indicates serial code
print(f'setting up rank={rank} (with world_size={world_size})')
# MASTER_ADDR = 'localhost'
MASTER_ADDR = '127.0.0.1'
MASTER_PORT = find_free_port()
# set up the master's ip address so this child process can coordinate
os.environ['MASTER_ADDR'] = MASTER_ADDR
print(f"{MASTER_ADDR=}")
os.environ['MASTER_PORT'] = MASTER_PORT
print(f"{MASTER_PORT}")

# - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
if torch.cuda.is_available():
# unsure if this is really needed
# os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
# os.environ['NCCL_IB_DISABLE'] = '1'
backend = 'nccl'
print(f'{backend=}')
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
# dist.init_process_group(backend, rank=rank, world_size=world_size)
# dist.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
print(f'--> done setting up rank={rank}')
dist.destroy_process_group()

mp.spawn(setup_process, args=(4,), world_size=4)
为什么这是挂?
nvidia-smi 输出:
$ nvidia-smi
Fri Mar 5 12:47:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 26C P0 51W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 25C P0 52W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | 0 |
| N/A 25C P0 51W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | 0 |
| N/A 25C P0 51W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 52W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | 0 |
| N/A 29C P0 53W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 |
| N/A 48C P0 231W / 400W | 7500MiB / 40537MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 7 N/A N/A 147243 C python 7497MiB |
+-----------------------------------------------------------------------------+
我如何在这台新机器上设置 ddp?

更新
顺便说一句,我已经成功安装了 APEX,因为其他一些链接说这样做,但它仍然失败。因为我做到了:
去了: https://github.com/NVIDIA/apex听从他们的指示
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
但在上述 I had to update gcc之前:
conda install -c psi4 gcc-5
它确实在我成功导入时安装了它,但没有帮助。

现在它实际上打印了一个错误消息:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 252, in train
setup_process(rank, world_size=opts.world_size)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/distributed.py", line 85, in setup_process
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/miranda9/miniconda3/envs/metalearning/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

During handling of the above exception, another exception occurred:

有关的:
  • https://github.com/pytorch/pytorch/issues/9696
  • https://discuss.pytorch.org/t/dist-init-process-group-hangs-silently/55347/2
  • https://forums.developer.nvidia.com/t/imagenet-hang-on-dgx-1-when-using-multiple-gpus/61919
  • 顶点建议:https://discourse.mozilla.org/t/hangs-on-dist-init-process-group-in-distribute-py/44686
  • https://github.com/pytorch/pytorch/issues/15638
  • https://github.com/pytorch/pytorch/issues/53395
  • 最佳答案

    以下修复基于 Writing Distributed Applications with PyTorch, Initialization Methods .
    第一期:
    除非你传入 nprocs=world_size 否则它会挂起至 mp.spawn() .换句话说,它正在等待“整个世界”出现,过程明智。

    第 2 期:
    MASTER_ADDR 和 MASTER_PORT 在每个进程的环境中需要相同,并且需要是运行 rank 0 进程的机器上的空闲地址:端口组合。

    这两个都是暗示或直接从上面链接中的以下引用中读取的(已添加强调):

    Environment Variable

    We have been using the environment variable initialization methodthroughout this tutorial. By setting the following four environmentvariables on all machines, all processes will be able to properlyconnect to the master, obtain information about the other processes,and finally handshake with them.

    MASTER_PORT: A free port on the machine that will host the process with rank 0.

    MASTER_ADDR: IP address of the machine that will host the process with rank 0.

    WORLD_SIZE: The total number of processes, so that the master knows how many workers to wait for.

    RANK: Rank of each process, so they will know whether it is the master of a worker.



    下面是一些代码来演示这两个操作:
    import torch
    import torch.multiprocessing as mp
    import torch.distributed as dist
    import os

    def find_free_port():
    """ https://stackoverflow.com/questions/1365265/on-localhost-how-do-i-pick-a-free-port-number """
    import socket
    from contextlib import closing

    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
    s.bind(('', 0))
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    return str(s.getsockname()[1])


    def setup_process(rank, master_addr, master_port, world_size, backend='gloo'):
    print(f'setting up {rank=} {world_size=} {backend=}')

    # set up the master's ip address so this child process can coordinate
    os.environ['MASTER_ADDR'] = master_addr
    os.environ['MASTER_PORT'] = master_port
    print(f"{master_addr=} {master_port=}")

    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    print(f"{rank=} init complete")
    dist.destroy_process_group()
    print(f"{rank=} destroy complete")

    if __name__ == '__main__':
    world_size = 4
    master_addr = '127.0.0.1'
    master_port = find_free_port()
    mp.spawn(setup_process, args=(master_addr,master_port,world_size,), nprocs=world_size)

    关于python - 如何解决 dist.init_process_group 挂起(或死锁)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66498045/

    32 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com