gpt4 book ai didi

apache-kafka - 卡夫卡消费者错误 : Marking coordinator dead

转载 作者:行者123 更新时间:2023-12-04 05:27:56 31 4
gpt4 key购买 nike

我在 Kafka 0.10.0.1 集群中有一个有 10 个分区的主题。我有一个产生多个消费者线程的应用程序。对于这个主题,我生成了 5 个线程。在我的应用程序日志中,我多次看到此条目

INFO :: AbstractCoordinator:600 - Marking the coordinator x.x.x.x:9092
(id:2147483646 rack: null) dead for group notifications-consumer

然后有几个条目说 (Re-)joining group notifications-consumer.之后我还看到一个警告说
Auto commit failed for group notifications-consumer: Commit cannot be completed since
the group has already rebalanced and assigned the partitions to another member. This means
that the time between subsequent calls to poll() was longer than the configured
max.poll.interval.ms, which typically implies that the poll loop is spending too much time
message processing. You can address this either by increasing the session timeout
or by reducing the maximum size of batches returned by poll() with max.poll.records.

现在我已经像这样调整了我的消费者配置
props.put("max.poll.records", 200);
props.put("heartbeat.interval.ms", 20000);
props.put("session.timeout.ms", 60000);

因此,即使在正确调整配置后,我仍然收到此错误。在重新平衡期间,我们的应用程序完全没有响应。请帮忙。

最佳答案

session.timeout.ms您只控制由于心跳导致的超时,这意味着已通过 session.timeout.ms自上次心跳以来的毫秒数,集群将您声明为死节点并触发重新平衡。

之前 KIP-62心跳是在轮询中发送的,但现在被移动到特定的后台线程,以避免在您花费的时间超过 session.timeout.ms 时被从集群中逐出。调用另一个 poll() .
将心跳分离到特定线程可以将处理与告诉集群您已启动并正在运行分离,但这引入了“活锁”情况的风险,其中进程处于事件状态但没有取得进展,因此除了使心跳独立的poll引入了新的超时,以确保消费者还活着并取得进展。
文档说明了 KIP-62 之前的实现:

As long as the consumer is sending heartbeats, it basically holds a lock on the partitions it was assigned. If the process becomes defunct in such a way that it cannot make progress but is nevertheless continuing to send heartbeats, then no other member in the group will be able to take over the partitions, which causes increasing lag. The fact that heartbeating and processing is all done in the same thread, however, guarantees that consumers must make progress to keep their assignment. Any stall which affects processing also affects heartbeats.



KIP-62 引入的变化包括:

Decoupling the processing timeout: We propose to introduce a separate locally enforced timeout for record processing and a background thread to keep the session active until this timeout expires. We call this new timeout as the "process timeout" and expose it in the consumer's configuration as max.poll.interval.ms. This config sets the maximum delay between client calls to poll()



根据您发布的日志,我认为您可能处于这种情况,您的应用程序花费的时间超过 max.poll.interval.ms (默认为 5 分钟)处理 200 条轮询记录。
如果您处于这种情况下,您只能减少更多 max.poll.records或增加 max.poll.interval.ms .

PD:
max.poll.interval.ms出现在您的日志中的配置来自(至少)kafka 0.10.1.0,所以我认为您在那里犯了一个小错误。

更新

如果我理解错了,请纠正我,但在您的上一条评论中,您说您正在创建 25 个消费者(例如,25 org.apache.kafka.clients.consumer.KafkaConsumer,如果您使用的是 Java)并将它们订阅到 N 个不同的主题,但使用相同的 group.id .
如果这是正确的,您将看到每次重新平衡 KafkaConsumer启动或停止,因为它会发送 JoinGroupLeaveGroup包含 group.id 的消息(参见相应的 kafka protocol )和 member.id ( member.id 不是主机,因此在同一进程中创建的两个使用者仍将具有不同的 ID)。请注意,这些消息不包含主题订阅信息(尽管该信息应该在代理中,但 kafka 不会将其用于重新平衡)。
所以每次集群收到一个 JoinGroupLeaveGroupgroup.id X,它将为所有具有相同 group.id 的消费者触发重新平衡。 X。

如果你用相同的 group.id 启动 25 个消费者您将看到重新平衡,直到创建最后一个消费者并且相应的重新平衡结束(如果您继续看到这一点,您可能会停止消费者)。

我有 this issue a couple months ago .

If we have two KafkaConsumer using the same group.id (running in the same process or in two different processes) and one of them is closed, it triggers a rebalance in the other KafkaConsumer even if they were subscribed to different topics. I suppose that brokers must be taking into account only the group.id for a rebalance and not the subscribed topics corresponding to the pair (group_id,member_id) of the LeaveGroupRequest but I'm wondering if this is the expected behavior or it's something that should be improved? I guess that is probably the first option to avoid a more complex rebalance in the broker and considering that the solution is very simple, i.e. just use different group ids for different KafkaConsumer that subscribe to different topics even if they are running in the same process.



When rebalance occurs we see duplicate messages coming



这是预期的行为,一个消费者消费了消息,但在提交偏移量之前触发了重新平衡并且提交失败。当重新平衡完成时,具有该主题分配的过程将再次使用该消息(直到提交成功)。

I segregated into two groups, now suddenly problem has disappeared since past 2 hours.



您在这里一针见血,但如果您不想看到任何(可避免的)重新平衡,您应该使用不同的 group.id对于每个主题。

这是一个 great talk关于不同的重新平衡方案。

关于apache-kafka - 卡夫卡消费者错误 : Marking coordinator dead,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50351464/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com