gpt4 book ai didi

Elasticsearch 未分配分片 CircuitBreakingException[[parent] 数据太大

转载 作者:行者123 更新时间:2023-12-02 22:53:00 26 4
gpt4 key购买 nike

我收到警告说 elasticsearch 有 2 个未分配的分片。我进行了以下 api 调用以收集更多详细信息。

    curl -s http://localhost:9200/_cluster/allocation/explain | python -m json.tool

输出如下

    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
"can_allocate": "no",
"current_state": "unassigned",
"index": "docs_0_1603929645264",
"node_allocation_decisions": [
{
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-30T06:10:16.305Z], failed_attempts[5], delayed=false, details[failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery, failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{IP}{IP:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested:
CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1972835086/1.8gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1972833976/1.8gb], new bytes reserved: [1110/1kb]]; ], allocation_status[no_attempt]]]"
}
],
"node_decision": "no",
"node_id": "1XEXS92jTK-asdfasdfasdf",
"node_name": "elasticsearch-data-2",
"transport_address": "IP1:9300"
},
{
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-30T06:10:16.305Z], failed_attempts[5], delayed=false, details[failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery, failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP1:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{IP2}{IP2:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP1:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested:
CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1972835086/1.8gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1972833976/1.8gb], new bytes reserved: [1110/1kb]]; ], allocation_status[no_attempt]]]"
},
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[docs_0_1603929645264][0], node[fIaSuZsNTwODgZnt90f7kQ], [P], s[STARTED], a[id=stHnyqjLQ7OwFbaqs5vWqA]]"
}
],
"node_decision": "no",
"node_id": "fIaSuZsNTwODgZnt90f7kQ",
"node_name": "elasticsearch-data-1",
"transport_address": "IP1:9300"
},
{
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-30T06:10:16.305Z], failed_attempts[5], delayed=false, details[failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery, failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP1:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{Ip2}{IP2:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP1:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested:
CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1972835086/1.8gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1972833976/1.8gb], new bytes reserved: [1110/1kb]]; ], allocation_status[no_attempt]]]"
}
],
"node_decision": "no",
"node_id": "o_9jyrmOSca9T12J4bY0Nw",
"node_name": "elasticsearch-data-0",
"transport_address": "IP2:9300"
}
],
"primary": false,
"shard": 0,
"unassigned_info": {
"at": "2020-10-30T06:10:16.305Z",
"details": "failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery, failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP1:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{IP2}{IP2:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP1:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested:
CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1972835086/1.8gb], which is larger than the limit of [1972122419/1.8gb], real usage: [1972833976/1.8gb], new bytes reserved: [1110/1kb]]; ",
"failed_allocation_attempts": 5,
"last_allocation_status": "no_attempt",
"reason": "ALLOCATION_FAILED"
}
}

我查询了断路器配置

    curl -X GET "localhost:9200/_nodes/stats/breaker?pretty

并且可以看到3个节点(elasticsearch-data-0、elasticsearch-data-1和elasticsearch-data-2)的parent limit_size_in_byes如下。

"parent" : {
"limit_size_in_bytes" : 1972122419,
"limit_size" : "1.8gb",
"estimated_size_in_bytes" : 1648057776,
"estimated_size" : "1.5gb",
"overhead" : 1.0,
"tripped" : 139
}

我引用了这个答案 https://stackoverflow.com/a/61954408并计划增加断路器或整个 JVM 堆的内存百分比。

这是一个 k8s 环境,elasticsearch-data 被部署为一个有 3 个副本的状态集。当我对 statefulset 进行描述时,我可以看到下面定义的 ENV 变量

Containers:
elasticsearch:
Image: custom/elasticsearch-oss-s3:7.0.0
Port: 9300/TCP
Host Port: 0/TCP
Limits:
cpu: 10500m
memory: 21Gi
Requests:
cpu: 10
memory: 20Gi
Environment:
DISCOVERY_SERVICE: elasticsearch-discovery
NODE_MASTER: false
PROCESSORS: 11 (limits.cpu)
ES_JAVA_OPTS: -Djava.net.preferIPv4Stack=true -Xms2048m -Xmx2048m

据此,堆大小似乎为 2048m

我登录到 elasticsearch-data pod 并在 elasticsearch 配置目录下看到以下文件

elasticsearch.keystore  elasticsearch.yml  jvm.options  log4j2.properties  repository-s3

elasticsearch.yml 没有任何堆配置等。它只有主节点的名称等。

下面是jvm.options文件


## JVM configuration

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g


## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly


## DNS cache policy
-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10


# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow


# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

从上面看,总堆大小似乎是 1g。

但是从这个pod的stateful set中定义的env变量来看,好像是2048m。

哪个是对的?

现在,从下面的链接

Circuit breaker settings | Elasticsearch

可以使用以下设置配置父级断路器:

indices.breaker.total.use_real_memory(静态)确定父断路器是否应考虑实际内存使用情况 (true) 或仅考虑子断路器保留的内存量 (false)。默认为真。

indices.breaker.total.limit(动态)整个父断路器的起始限制。如果 indices.breaker.total.use_real_memory 为 false,则默认为 JVM 堆的 70%。如果 indices.breaker.total.use_real_memory 为真,则默认为 JVM 堆的 95%。

但是错误和我查询的断路器统计信息中的限制值是这个 - 1972122419 字节(1.8G)。这似乎不是 2048M 的 95% 或 1g。

现在,我怎样才能增加堆或 breaker parent 的内存限制,以便摆脱这个错误?

最佳答案

这里有两件事,分片分配异常和断路器异常(看起来是嵌套异常)。

请在您的集群中使用以下命令重新触发分配,因为之前的所有重试都失败了,如果您仔细注意到,您的异常消息中也会提示同样的操作。有关以下命令的更多信息在此related Github issue comment .

curl -XPOST ':9200/_cluster/reroute?retry_failed

如果还是不行,那你就得修复父级的断路器异常,你应该使用http://localhost:9200/_nodes/stats API来知道确切的你的 ES 节点的堆,并相应地增加它。

关于Elasticsearch 未分配分片 CircuitBreakingException[[parent] 数据太大,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64606301/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com