gpt4 book ai didi

Redis Sentinel 和 fix-slave-config : Redis node is getting set as slave of two masters when it should not be

转载 作者:可可西里 更新时间:2023-11-01 11:36:14 28 4
gpt4 key购买 nike

我正在尝试在大型 redis 队列中使用哨兵进行故障转移(12 个哨兵,500 多个分片,每个分片一个主站和一个从站)。我遇到了一个非常奇怪的问题,我的哨兵反复向某些 redis 节点发出命令 +fix-slave-config。我没有注意到这种情况发生在较小的规模上,因为它是值得的。

我注意到两个具体问题:

  1. +fix-slave-config 消息,如上所述
  2. sentinel.conf 显示某些 slave 有两个 master(他们应该只有一个)

处于起始状态的舰队有一个特定的从属节点 XXX.XXX.XXX.177 和一个主节点 XXX.XXX.XXX.244(它们一起构成舰队中的分片 188)。在没有任何节点中断的情况下,slave 的 master 切换到 XXX.XXX.XXX.96(分片 188 的 master)然后返回,然后再返回。这是通过 sshing 进入从节点和主节点并检查 redis-cli 信息来验证的。所有 Redis 节点都以正确的配置启动。所有 Sentinel 节点在它们的 sentinel.conf 中都有正确的配置。当我在每次从属-> 主更改后查询它们时,每个 Sentinel 都有完全相同的主列表。

在我的 12 个哨兵中,记录了以下内容。每分钟都会发送一条 +fix-slave-config 消息:

Sentinel #8: 20096:X 22 Oct 01:41:49.793 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #1: 9832:X 22 Oct 01:42:50.795 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379
Sentinel #6: 20528:X 22 Oct 01:43:52.458 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #10: 20650:X 22 Oct 01:43:52.464 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #2: 20838:X 22 Oct 01:44:53.489 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379

这是 SENTINEL MASTERS 命令的输出。奇怪的是,shard-188 有两个从属,而实际上它应该只有 1 个。当 XXX.XXX.XXX.177 在 shard-172 和 shard-182 下时,输出看起来是一样的。

案例 1)XXX.XXX.XXX.244 是 XXX.XXX.XXX.177 的主人

183)  1) "name"
2) "shard-172"
3) "ip"
4) "XXX.XXX.XXX.244"
5) "port"
6) "6379"
7) "runid"
8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "14"
17) "last-ping-reply"
18) "14"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5636"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17154406"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "1"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
72) 1) "name"
2) "shard-188"
3) "ip"
4) "XXX.XXX.XXX.96"
5) "port"
6) "6379"
7) "runid"
8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "927"
17) "last-ping-reply"
18) "927"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5333"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17154312"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "2"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"

案例 2)XXX.XXX.XXX.96 是 XXX.XXX.XXX.177 的主人

79)  1) "name"
2) "shard-172"
3) "ip"
4) "XXX.XXX.XXX.244"
5) "port"
6) "6379"
7) "runid"
8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "1012"
17) "last-ping-reply"
18) "1012"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "1261"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17059720"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "1"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
273) 1) "name"
2) "shard-188"
3) "ip"
4) "XXX.XXX.XXX.96"
5) "port"
6) "6379"
7) "runid"
8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "886"
17) "last-ping-reply"
18) "886"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5762"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17059758"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "2"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"

每个哨兵的起始 sentinel.conf 是

maxclients 20000
loglevel notice
logfile "/home/redis/logs/sentinel.log"
sentinel monitor shard-172 redis-b-172 7
sentinel down-after-milliseconds shard-172 30000
sentinel failover-timeout shard-172 60000
sentinel parallel-syncs shard-172 1
....
sentinel monitor shard-188 redis-b-188 7
sentinel down-after-milliseconds shard-188 30000
sentinel failover-timeout shard-188 60000
sentinel parallel-syncs shard-188 1

这是几分钟后生成的 sentinel.conf(针对所有哨兵)——注意两个从属:

sentinel monitor shard-172 XXX.XXX.XXX.244 6379 7
sentinel failover-timeout shard-172 60000
sentinel config-epoch shard-172 0
sentinel leader-epoch shard-172 0
sentinel known-slave shard-172 XXX.XXX.XXX.177 6379 <--- True slave of shard-172
sentinel known-sentinel shard-172 ...
...
sentinel monitor shard-188 XXX.XXX.XXX.96 6379 7
sentinel failover-timeout shard-188 60000
sentinel config-epoch shard-188 0
sentinel leader-epoch shard-188 0
sentinel known-slave shard-188 XXX.XXX.XXX.194 6379 <--- True slave of shard-188
sentinel known-slave shard-188 XXX.XXX.XXX.177 6379
sentinel known-sentinel shard-188 ...

最佳答案

这就是我所说的“ Ant 问题”:你有两个(或更多)pod(主+从)混合在一起。当您显示您的一个 pod 有多个从属时,您就表明了这一点。

具体来说:

Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1.

你需要做的是:

  1. 通过 sentinel remove shard-NNN
  2. 从所有哨兵中删除这些部分(shard-188 和 shard-???)
  3. 将那些 slave 所在的 pod 放下
  4. 正确配置它们(正确的 slaveof 命令/配置)
  5. 让他们重新上线
  6. 确保他们每个人只有一个正确的奴隶
  7. 通过 sentinel monitor ...
  8. 将它们添加回 Sentinels

现在从技术上讲,您可以使用 sentinel reset 命令,但您将面临潜在的计时问题,因此从 Sentinel 中删除它们是必经之路。您可以选择离开主/从并简单地适本地重新配置从属。如果您走那条路线,请等待几分钟并在进入第 6 步之前检查从属列表。

关于Redis Sentinel 和 fix-slave-config : Redis node is getting set as slave of two masters when it should not be,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33272150/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com