gpt4 book ai didi

docker - [HTCONDOR][kubernetes/k8s] : Unable to start minicondor image within k8s - condor_master not working

转载 作者:行者123 更新时间:2023-12-04 13:51:23 28 4
gpt4 key购买 nike

后期编辑
问题是由于:PSP ( Pod security policy ) 默认情况下,我的 condor 不允许升级用户。这就是它不起作用的原因。因为supervisord运行为 root用户并尝试写入日志并以 root 启动 condor 收集器而不是作为其他用户(即 condor )
描述mini-condor基本镜像未在 kubernetes rancher pod 上按预期启动。
我在用 :

  • 此图片:https://hub.docker.com/r/htcondor/mini在 Rancher (k8s) 的自定义命名空间中

  • ps : the image was working perfectly on :

    • a local env
    • minikube default installation

    我将它作为一个简单的部署运行:
    当 Pod 启动时,Kubernetes 默认日志文件显示:
    2021-09-15 09:26:36,908 INFO supervisord started with pid 1
    2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
    2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
    2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
    2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
    2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
    2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
    2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
    2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
    2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
    2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
    2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
    2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
    2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
    2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
    2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
    2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
    2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
    2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly
    这是一个简短的 cmd 和输出结果:


    CMD
    输出

    condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618> condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
    1)首先尝试解决问题
    我决定自定义图像,但是 错误是一样的
    用于尝试修复权限问题的 docker 图像
  • 图片:

  • FROM htcondor/mini:9.2-el7

    RUN condor_master

    RUN chown condor:root /var/
    RUN chown condor:root /var/log
    RUN chown -R condor:root /var/log/
    RUN chown -R condor:condor /var/log/condor

    RUN chown condor:condor /var/log/condor/ProcLog
    RUN chown condor:condor /var/log/condor/MasterLog

    RUN chmod 775 -R /var/
  • Kubernetes - Rancher
  • yaml 文件:

  • apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: htcondor-mini--all-in-one
    namespace: grafana-exporter
    spec:
    containers:
    - image: <custom_image>
    imagePullPolicy: Always
    name: htcondor-mini--all-in-one
    resources: {}
    securityContext:
    capabilities: {}
    stdin: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    tty: true
    dnsConfig: {}
    dnsPolicy: ClusterFirst
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    terminationGracePeriodSeconds: 30
    这是一个简短的 cmd 和输出结果:


    CMD
    输出

    condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618> condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp` ls -ld /var/drwxrwxr-x 1 condor root 2020 年 11 月 13 日 17 日/var/
    ls -ld /var/log/drwxrwxr-x 1 condor root 65 Oct 7 11:54/var/log/
    ls -ld /var/log/condordrwxrwxr-x 1 condor condor 240 Oct 7 11:23/var/log/condor
    ls -ld /var/log/condor/MasterLog-rwxrwxr-x 1 condor condor 3243 Oct 7 11:23/var/log/condor/MasterLog


    MasterLog 内容:
    10/07/21 11:23:21 ******************************************************
    10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
    10/07/21 11:23:21 ** /usr/sbin/condor_master
    10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
    10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
    10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
    10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
    10/07/21 11:23:21 ** PID = 7
    10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
    10/07/21 11:23:21 ******************************************************
    10/07/21 11:23:21 Using config source: /etc/condor/condor_config
    10/07/21 11:23:21 Using local config sources:
    10/07/21 11:23:21 /etc/condor/config.d/00-htcondor-9.0.config
    10/07/21 11:23:21 /etc/condor/config.d/00-minicondor
    10/07/21 11:23:21 /etc/condor/config.d/01-misc.conf
    10/07/21 11:23:21 /etc/condor/condor_config.local
    10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
    10/07/21 11:23:21 CLASSAD_CACHING is OFF
    10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
    10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
    10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
    10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
    10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
    10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
    10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
    10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
    10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
    10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
    10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
    10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
    10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
    10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
    10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
    10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
    10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
    10/07/21 11:23:23 Found /var/log/condor/.collector_address.
    10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
    10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
    10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
    10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started
    非常感谢您的阅读。希望它能帮助很多其他人。

    最佳答案

    问题原因
    问题是由于:PSP policy (Pod 安全策略)
    默认情况下,我的 condor 用户不允许升级。
    解决方案
    我目前找到的最佳解决方案是 以 condor 用户身份运行所有内容并授予 condor 用户权限 .为此,您需要:

  • supervisord.conf : 运行主管为 condor用户
  • supervisord.conf : 在 /tmp 中运行日志和套接字
  • Dockerfile : 通过 condor 更改大部分文件夹的所有者
  • deployment.yaml设置 ID64 (神鹰用户)
  • Dockerfile
    FROM htcondor/mini:9.2-el7

    # SET WORKDIR
    WORKDIR /home/condor/
    RUN chown condor:condor /home/condor

    # COPY SUPERVISOR
    COPY supervisord.conf /etc/supervisord.conf

    # Need to run the cmd to create all dir
    RUN condor_master

    # FIX PERMISSION ISSUES FOR RANCHER
    RUN chown -R condor:condor /var/log/ /tmp &&\
    chown -R restd:restd /home/restd &&\
    chmod 755 -R /home/restd

    supervisord.conf :
    [supervisord]
    user=condor
    nodaemon=true
    logfile = /tmp/supervisord.log
    directory = /tmp
    pidfile = /tmp/supervisord.pid
    childlogdir = /tmp

    # next 3 sections contain using supervisorctl to manage daemons
    [unix_http_server]
    file=/tmp/supervisord.sock
    chown=condor:condor
    chmod=0777
    user=condor

    [rpcinterface:supervisor]
    supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

    [supervisorctl]
    serverurl=unix:///tmp/supervisor.sock

    [program:condor_master]
    user=condor
    command=/usr/sbin/condor_master -f
    autostart=true
    autorestart=true
    redirect_stderr=true
    stdout_logfile = /var/log/condor_master.log
    stderr_logfile = /var/log/condor_master.error.log
    deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    spec:
    containers:
    - image: <condor-image>
    imagePullPolicy: Always
    name: htcondor-exporter
    ports:
    - containerPort: 8080
    name: myport
    protocol: TCP
    resources: {}
    securityContext:
    capabilities: {}
    runAsNonRoot: false
    runAsUser: 64
    stdin: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    tty: true

    关于docker - [HTCONDOR][kubernetes/k8s] : Unable to start minicondor image within k8s - condor_master not working,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69190926/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com