gpt4 book ai didi

python - Ray 集群配置 file_mounts 部分不允许启动工作节点

转载 作者:行者123 更新时间:2023-12-02 03:15:03 24 4
gpt4 key购买 nike

我尝试使用配置文件中的 file_mounts block 将少量文件分发到 AWS EC2 上的 Ray 集群中的每个节点:-

文件挂载:{ "./": "./run_files"}

集群启动时仅使用一个主节点,run_files 目录的内容已正确复制到该主节点上。但是,所请求的两个工作节点不会启动。如果我省略 file_mounts 部分,工作人员就会启动。 Ray 监视器指示在 Anaconda3 安装的 matplotlib 子目录中定位文件 libtcl.so 时出现问题。该文件位于主节点上的正确路径上,因此工作节点上的设置似乎无法正常工作:-

$ ray exec ray_conf.yaml  'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'
2019-05-29 19:36:14,019 INFO updater.py:95 -- NodeUpdater: Waiting for IP of i-073950262949fe9a8...
2019-05-29 19:36:14,019 INFO log_timer.py:21 -- NodeUpdater: i-073950262949fe9a8: Got IP [LogTimer=362ms]
2019-05-29 19:36:14,025 INFO updater.py:272 -- NodeUpdater: Running tail -n 100 -f /tmp/ray/session_*/logs/monitor* on 54.175.173.233...
==> /tmp/ray/session_2019-05-29_23-35-49_842129_4407/logs/monitor.err <==
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/monitor.py", line 376, in <module>
redis_password=args.redis_password)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/monitor.py", line 54, in __init__
self.load_metrics)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 349, in __init__
self.reload_config(errors_fatal=True)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 523, in reload_config
raise e
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 516, in reload_config
new_config["worker_start_ray_commands"]
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 790, in hash_runtime_conf
add_content_hashes(local_path)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 778, in add_content_hashes
add_hash_of_file(fpath)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 764, in add_hash_of_file
with open(fpath, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './anaconda3/pkgs/matplotlib-2.1.0-py36hba5de38_0/lib/libtcl.so'

==> /tmp/ray/session_2019-05-29_23-35-49_842129_4407/logs/monitor.out <==

(请注意,这个问题是“Workers not being returned on EC2 by ray”问题的后续问题,我在一个新问题中继续,因为现在更具体地确定了错误的来源。)

最佳答案

我认为 libtcl.so 错误消息非常具有误导性。问题是 file_mounts 远程路径不能是工作人员的主目录(./和 ~/都不起作用);它必须是一个子目录。所以以下是成功的:-

file_mounts: {"~/run_files": "./run_files"}

关于python - Ray 集群配置 file_mounts 部分不允许启动工作节点,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56370163/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com