gpt4 book ai didi

kubernetes - 在运行TPOT时,Dask不断失败,并导致 worker 死亡

转载 作者:行者123 更新时间:2023-12-02 12:05:12 29 4
gpt4 key购买 nike

我正在tpot上运行dask在gcp的kubernetes集群上运行,该集群是24核120 GB内存,带有4个kubernetes节点,我的kubernetes yaml是

apiVersion: v1
kind: Service
metadata:
name: daskd-scheduler
labels:
app: daskd
role: scheduler
spec:
ports:
- port: 8786
targetPort: 8786
name: scheduler
- port: 8787
targetPort: 8787
name: bokeh
- port: 9786
targetPort: 9786
name: http
- port: 8888
targetPort: 8888
name: jupyter
selector:
app: daskd
role: scheduler

type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-scheduler
spec:
replicas: 1
template:
metadata:
labels:
app: daskd
role: scheduler
spec:
containers:
- name: scheduler
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: ["/opt/conda/bin/dask-scheduler"]
resources:
requests:
cpu: 1
memory: 20000Mi # set aside some extra resources for the scheduler
ports:
- containerPort: 8786
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-worker
spec:
replicas: 3
template:
metadata:
labels:
app: daskd
role: worker
spec:
containers:
- name: worker
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: [
"/bin/bash",
"-cx",
"env && /opt/conda/bin/dask-worker $DASKD_SCHEDULER_SERVICE_HOST:$DASKD_SCHEDULER_SERVICE_PORT_SCHEDULER --nthreads 8 --nprocs 1 --memory-limit 5e9",
]
resources:
requests:
cpu: 2
memory: 20000Mi

我的数据是400万行和77列,每当我在tpot分类器上运行时,它就在dask集群上运行一段时间,然后崩溃,输出日志看起来像
KilledWorker:
("('gradientboostingclassifier-fit-1c9d29ce92072868462946c12335e5dd',
0, 4)", 'tcp://10.8.1.14:35499')

我尝试按照dask分布式文档的建议增加每个工作人员的线程数,但问题仍然存在。
我的一些观察是:
  • 如果n_jobs更少,崩溃将花费更长的时间(对于n_jobs = 4,它
    在崩溃前跑了20分钟),因为
    n_jobs = -1。
  • 实际上,它将开始工作并获得较少数据的优化模型,
    具有10000个数据,效果很好。

  • 所以我的问题是,要完成这项工作,我需要进行哪些更改和修改,我想它是可行的,因为我听说过dask能够处理比我更大的数据。

    最佳答案

    在Dask的official文档页面上描述的最佳做法是:

    Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. Otherwise your workers may get killed by Kubernetes as they pack into the same node and overwhelm that nodes’ available memory, leading to KilledWorker errors.



    在您的情况下,这些配置参数的值与我所看到的不匹配:

    Kubernetes的容器限制 20 GB与dask-worker命令限制 5 GB

    关于kubernetes - 在运行TPOT时,Dask不断失败,并导致 worker 死亡,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54417135/

    29 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com