gpt4 book ai didi

amazon-web-services - 在私有(private)子网内的集群中运行 ECS 任务保持供应状态

转载 作者:行者123 更新时间:2023-12-04 15:19:36 32 4
gpt4 key购买 nike

我们想搭建一个具有以下特点的ECS集群:

  • 它必须在 VPC 内运行,然后,我们需要 awsvpc 模式
  • 它必须使用 GPU 实例,所以我们不能使用 Fargate
  • 它必须动态提供实例,因此,我们需要一个容量提供者
  • 它将运行将直接通过 AWS ECS API 触发的任务(批处理作业)。因此,我们不需要服务,只需要任务定义。
  • 这些任务必须能够访问 S3(互联网),因此根据 AWS 文档,实例必须放置在私有(private)子网 ( a reference to docs ) 中。

  • 我们已经读过 this post在 stackoverflow 中,它说我们需要设置一个带有路由表的私有(private)子网,该路由表指向在公共(public)子网中配置的 NAT 网关,并且该公共(public)子网应该指向一个 Internet 网关。我们已经有了这个配置。我们还在路由表中配置了一个 S3 vpc 端点。
    波纹管,你可以在terraform中看到集群的一些相关配置(为了简单起见,我只放了相关部分):

    # Launch template
    resource "aws_launch_template" "train-launch-template" {
    name_prefix = "{var.project_name}-launch-template-${var.env}"
    image_id = "ami-01f62a207c1d180d2"
    instance_type = "m5.large"
    key_name="XXXXXX"
    iam_instance_profile {
    name = aws_iam_instance_profile.ecs-instance-profile.name
    }
    user_data = base64encode(data.template_file.user_data.rendered)

    network_interfaces {
    associate_public_ip_address = false
    security_groups = [aws_security_group.ecs_service.id]
    }
    }


    # Task definition
    resource "aws_ecs_task_definition" "task" {
    family = "${var.project_name}-${var.env}-train-task"
    execution_role_arn = data.aws_iam_role.ecs_task_execution_role.arn
    task_role_arn = aws_iam_role.ecs_train_task_role.arn
    requires_compatibilities = ["EC2"]
    cpu = var.ecs_cpu
    network_mode = "awsvpc"
    memory = var.ecs_memory
    container_definitions = data.template_file.app_definition.rendered

    tags = {
    Stage = var.env_tag
    Project = var.project_name_tag
    }
    }


    # Cluster
    resource "aws_ecs_cluster" "cluster" {
    name = "${var.project_name}-${var.env}-train-ecs-cluster"
    capacity_providers = [aws_ecs_capacity_provider.train-capacity-provider.name]
    default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.train-capacity-provider.name
    }
    tags = {
    Project = var.project_name_tag
    Stage = var.env_tag
    }
    }
    我们还配置了实例和任务所需的所有角色以访问所需资源(S3、ECR、ECS)。
    AMI 对应一个 ECS 优化实例(目前在 eu-west-1 中发布的最后一个版本)。
    在启动模板中,由于 this link 中的解释,我们删除了实例的公共(public) IP。
    我们已经进化到这种配置试图使其工作,但我们一次又一次面临同样的问题:当任务被触发时,容量提供者启动一个实例,但该任务从未放置在容器实例中并保持无限期地处于 PROVISIONING 状态。
    使用相同的配置但将实例置于公共(public)子网中,任务将置于容器实例中,但是,正如 the first link 中警告的那样,该任务无法访问 Internet。
    我们需要一些启示或踪迹来追踪。先感谢您。
    更新:根据要求,我添加了有关自动缩放的其余部分
    resource "aws_autoscaling_group" "train-autoscaling" {
    availability_zones = ["eu-west-1b"]
    desired_capacity = 0
    max_size = 10
    min_size = 0
    protect_from_scale_in = true


    launch_template {
    id = aws_launch_template.train-launch-template.id
    version = "$Latest"
    }

    tags = [
    {
    key = "Project",
    value = var.project_name_tag
    propagate_at_launch = true
    },
    {
    key = "Stage",
    value = var.env_tag
    propagate_at_launch = true
    }
    ]
    }

    resource "aws_ecs_capacity_provider" "train-capacity-provider" {
    name = "${var.project_name}-${var.env}-train-capacity-provider"

    auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.train-autoscaling.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
    status = "ENABLED"
    target_capacity = 100
    maximum_scaling_step_size = 1
    minimum_scaling_step_size = 1
    }
    }
    }

    data "template_file" "user_data" {
    template = "${file("${path.module}/user_data.sh")}"

    vars = {
    cluster_name = "${var.project_name}-${var.env}-train-ecs-cluster"
    }
    }
    更新 2(AWS 控制台信息):
    正在运行的容器实例
    Container instances running
    详细容器实例:
    enter image description here
    待处理任务:
    pending task
    待处理任务详情:
    pending task details
    更新 3:
    30 分钟后任务停止,这是显示的消息(任务无法启动):
    enter image description here
    更新 4:
    来自容器实例的日志。
    ecs-agent.log
    level=info time=2020-08-28T11:09:21Z msg="Loading configuration" module=agent.go
    level=info time=2020-08-28T11:09:21Z msg="Amazon ECS agent Version: 1.44.1, Commit: 1f05fbf0" module=agent.go
    level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
    level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
    level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=docker_image_manager.go
    level=info time=2020-08-28T11:09:21Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
    level=info time=2020-08-28T11:09:21Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
    level=info time=2020-08-28T11:09:21Z msg="Event stream ContainerChange start listening..." module=eventstream.go
    level=info time=2020-08-28T11:09:21Z msg="Loading state!" module=state_manager.go
    level=info time=2020-08-28T11:09:23Z msg="Registering Instance with ECS" module=agent.go
    level=info time=2020-08-28T11:09:23Z msg="Remaining mem: 7680" module=client.go
    level=info time=2020-08-28T11:09:23Z msg="Registered container instance with cluster!" module=client.go
    level=info time=2020-08-28T11:09:23Z msg="Registration completed successfully. I am running as 'arn:aws:ecs:eu-west-1:XXXXXXXXXXXXXXXX:container-instance/foqum-read-dev-train-ecs-cluster/95559f936f8d44de9373595009fcd588' in cluster 'foqum-read-dev-train-ecs-cluster'" module=agent.go
    level=info time=2020-08-28T11:09:23Z msg="Beginning Polling for updates" module=agent.go
    level=info time=2020-08-28T11:09:23Z msg="Initializing stats engine" module=engine.go
    level=info time=2020-08-28T11:09:23Z msg="Event stream DeregisterContainerInstance start listening..." module=eventstream.go
    level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXXX-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXXXX%3Acontainer-instance%2FXXXXXXXX-cluster%2F95559fXXXXXXde9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
    level=info time=2020-08-28T11:09:23Z msg="NO_PROXY set:XXX.254.169.XXXX,XXXX.254.XXX.2,/var/run/docker.sock" module=client.go
    level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-a-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&clusterArn=XXXXX-ecs-cluster&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%XXXXXX%3Acontainer-instance%2FXXXXX-ecs-cluster%2F9XXXXX6f8d44de9373595009fcd588&dockerVersion=DockerVersion%3A+19.03.6-ce&sendCredentials=true&seqNum=1" module=client.go
    level=info time=2020-08-28T11:09:23Z msg="Connected to TCS endpoint" module=handler.go
    level=info time=2020-08-28T11:09:23Z msg="Connected to ACS endpoint" module=acs_handler.go
    level=info time=2020-08-28T11:20:04Z msg="TCS Websocket connection closed for a valid reason" module=handler.go
    level=info time=2020-08-28T11:20:04Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXecs-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXX3Acontainer-instance%2FZZZXXXXX-ecs-cluster%2F95XXX936f8d44de9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
    level=info time=2020-08-28T11:20:04Z msg="Connected to TCS endpoint" module=handler.go
    ecs-init.log
    2020-08-28T11:09:19Z [INFO] pre-start
    2020-08-28T11:09:20Z [INFO] start
    2020-08-28T11:09:20Z [INFO] No existing agent container to remove.
    2020-08-28T11:09:20Z [INFO] Starting Amazon Elastic Container Service Agent

    最佳答案

    最后!!解开了谜团!
    问题不在于集群配置。当通过 ECS API 调用 run_task 时,您需要指定任务应该运行的子网。
    我们的代码在此字段中设置了公共(public)子网之一的值。出于这个原因,当我们将容器实例更改为与此公共(public)子网对应的可用区时,任务就被放置了。
    从代码更改此调用,任务被正确放置并且它可以访问互联网。

    关于amazon-web-services - 在私有(private)子网内的集群中运行 ECS 任务保持供应状态,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63621979/

    32 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com