java how to avoid stop-the-world garbage collection(Java如何避免停止世界垃圾回收)-6ren

java how to avoid stop-the-world garbage collection(Java如何避免停止世界垃圾回收)

转载作者：bug小助手更新时间：2023-10-26 21:27:05

we have an market app that streams market data, this gets a spike of data on a daily basis.
due to the allocations during the spike the VM ends up doing a stop-the-world garbage collection.
From the usage graphs it looks like all the allocation during the spike goes directly to old gen? are there any g1gc parameters that can be used to change this behavior? i am thinking by keeping them in eden space we'll be able to avoid the STW-gc

我们有一个市场应用程序，它可以流传输市场数据，每天都会有大量数据。由于峰值期间的分配，VM最终执行停止世界垃圾收集。从使用率图上看，在峰值期间所有的分配都直接流向了老一代？是否有任何g1gc参数可用于更改此行为？我在想，把它们留在伊甸园里，我们就能避开STW-GC

tried -XX:MaxTenuringThreshold=15 does not seem to have an effect.

尝试-XX：MaxTenuringThreshold=15似乎没有效果。

using java 8 with g1gc

使用带有g1gc的Java 8

Total Heap Usage

总堆使用率

G1 Eden space: during the spike the usage is going down

G1伊甸园空间：在高峰期间，使用率正在下降

G1 old gen: looks like all allocation during the spike is going directly to old gen

G1老一代：看起来在峰值期间所有的分配都直接流向了老一代

更多回答

Does this answer your question? Does Java Garbage Collect always has to "Stop-the-World"?

这回答了你的问题吗？Java垃圾收集总是要“阻止世界”吗？

no that does not answer my question @Pino i know that there are cases when STW-GCs are necessary. the question is in my particular usage scenario are there any parameters that I can use to avoid STW-GC

不，这没有回答我的问题@皮诺，我知道在某些情况下，STW-GC是必要的。问题是在我的特定使用场景中，有没有什么参数可以用来避免STW-GC

Think about it this way: if there was a parameter which can avoid stop-the-world pauses even when we don’t know anything about their cause, like in your example, what do you think would be the reason that this parameter is not on by default?

这样想一想：如果有一个参数可以避免停止世界暂停，即使我们不知道它们的原因，就像你的例子中的那样，你认为这个参数默认没有打开的原因是什么？

oh no - i am not looking for a magical parameter that would avoid stop-the-world pause, i have explained what is happening before the pause and i am asking if there are any tunables that would help in this situation.

哦，不-我不是在寻找一个神奇的参数，将避免停止世界暂停，我已经解释了暂停之前发生了什么，我问是否有任何可调参数，将有助于在这种情况下。

You didn’t explain what’s happening, you just described the symptoms we can see in these pictures. You also provided the conjecture that the symptoms imply that allocations went directly to the Old Gen, which is not backed by anything. The graph would look the same if just all young objects were promoted to old. But regardless of whether the conjecture is correct or not, it still doesn’t explain why this happens and without knowing why it happens, there is no way to fix the issue. Except for a --do-not-make-a-spike-a-day option which does not exist.

你没有解释发生了什么，你只是描述了我们在这些图片中看到的症状。你还提出了一个猜测，即这些症状意味着拨款直接流向了没有任何支持的老一代人。如果所有年轻的物体都被提升为老年物体，这张图看起来会是一样的。但无论猜测是否正确，它仍然无法解释为什么会发生这种情况，在不知道为什么会发生的情况下，没有办法解决这个问题。除了一个不存在的--不做一天一个尖峰的选项。

优秀答案推荐

The problem is likely that your application is allocating a very large amount of long-lived objects in a short amount of time. As other commenters have mentioned, you cannot stop all STW pauses, but given your large heap & the fact that you're posting here, it probably means you hit a very long STW pause - maybe even a Full GC - during the spike. Enabling detailed logs with -verbose:gc -XX:+PrintGCDetails would give us better details to help.

问题很可能是您的应用程序在很短的时间内分配了大量的长时间对象。正如其他评论者所提到的，你不能停止所有的STW暂停，但考虑到你的大堆&你在这里发帖的事实，这可能意味着你在峰值期间遇到了非常长的STW暂停-甚至可能是完整的GC。使用-Verbose：GC-XX：+PrintGCDetail启用详细日志将为我们提供更好的帮助详细信息。

Since you didn't provide GC logs, it's difficult to say exactly, but there are some observations from your plots that can help us figure it out:

由于您没有提供GC日志，所以很难准确地说出来，但您的图表中有一些观察结果可以帮助我们弄清楚：

The Eden space usage gets very small, relative to the steady state.

相对于稳定状态，伊甸园空间的使用量变得非常小。

Under normal conditions in G1, a "Young GC" is by far the most common GC, and it includes a STW pause that should ideally be small. Most of the points on your plots are probably Young GC's; when a GC event happens, the GC log prints the region sizes & a log analysis/plotter can process them. In Young GC, G1 starts at the reference roots of your application, scans the reference tree deeper and deeper to find all "live" objects, and copies them to one of two "Survivor" spaces.

在G1的正常情况下，“年轻GC”是迄今为止最常见的GC，它包括STW停顿，理想情况下停顿应该很小。曲线图上的大多数点可能是Young GC的；当GC事件发生时，GC日志打印区域大小&日志分析/绘图仪可以处理它们。在Young GC中，G1从应用程序的引用根开始，一遍又一遍地扫描引用树以找到所有的“活动”对象，并将它们复制到两个“Survivor”空间之一。

Roughtly speaking, because the length of a Young GC STW pause is proportional to the amount of live objects, G1 will shrink the size of Eden to try to ensure future pauses meet the latency target.

粗略地说，由于Young GC STW暂停的长度与活动对象的数量成正比，因此G1将缩小伊甸园的大小，以尝试确保未来的暂停满足延迟目标。

The Old Gen usage has a rapid increase during the traffic spike of approx. 16GB. It is nearly monotonic, but contains a very small decrease mid way. Increasing MaxTenuringThreshold from its default of 6 to 15 didn't improve things.

在流量高峰期间，旧一代的使用量有一个快速的增长，大约。16 GB。它几乎是单调的，但在中途包含了非常小的降幅。将MaxTenuringThreshold的默认值从6增加到15并没有改善情况。

Since your application continues to allocate lots of memory quickly, and the Eden space is now smaller, it fills up fast. The time between when an object is allocated & when a Young GC processes it is now even shorter, meaning those objects have less time to die. On average, this means a higher fraction will need to be copied to Survivor. By default, the Survivor region is just 1/6'th the size of Eden. When the fraction of live Eden objects is high, existing Survivor objects will get prematurely promoted to Old to free up space. Your application may allocate so much that objects promote directly from Young -> Old. This explains why you see a rapid increase in the size of Old Gen. And since objects don't stay in Young or Survivor long enough to reach even the default MaxTenuringThreshold, explaining why increasing it did nothing.

由于您的应用程序继续快速地分配大量内存，并且现在的Eden空间更小，因此它很快就会被填满。从分配对象到青年GC处理对象之间的时间现在更短了，这意味着这些对象有更少的死亡时间。平均而言，这意味着需要将更高的比例复制到Survivor。默认情况下，幸存者区域的大小只有伊甸园的六分之一。当活的伊甸园对象的比例很高时，现有的幸存者对象将提前升级为旧对象以释放空间。您的应用程序可能会分配如此之多的资源，以至于对象直接从Young->Old升级。这解释了为什么您会看到Old Gen.的大小快速增加，并且由于对象在Young或Survivor中停留的时间不够长，甚至不足以达到默认的MaxTenuringThreshold，解释了为什么增加它没有任何作用。

Normally, G1 tries to collect Old concurrently with a series of Mixed collections. The small dip during the spike suggests at least one Mixed collection may have happened, but it freed up almost no space. Since there are a few more points on the plot before the big drop, that may mean no further Mixed collections ran or they ran but freed up very little space. The large drop could be a Mixed collection, but given that you're asking about how to stop bad pause behavior, it's probably a Full GC. With your heap size (>40GB) and the amount of live data in your Old gen (~26GB), a Full GC would generally be quite long.

通常情况下，G1会尝试同时收集Old和一系列混合集合。峰值期间的小幅下跌表明，可能至少发生了一次混合收集，但它几乎没有腾出任何空间。由于在大降幅之前还有几个点，这可能意味着没有更多的混合集合运行，或者它们运行但释放的空间很少。较大的降幅可能是混合的，但考虑到您正在询问如何阻止糟糕的暂停行为，这可能是一个完整的GC。考虑到您的堆大小(>40 GB)和旧Gen中的活动数据量(大约26 GB)，完整的GC通常会相当长。

Suggested strategies:

建议的战略：

A: (Avoid Full GC): If it's true the large drop is a Full GC, because your Old gen live set size is already quite large, you need to either increase your heap by at least ~5GB, or refactor your application to keep fewer objects in memory long term (reduce the steady state 26GB Old size). Theoretically, if you did try setting G1NewSize to 35% so that the ~16GB spike fits within Eden, it could stay in Eden long enough to die, avoiding a long Young GC pause. This isn't likely to work well in practice, though, since a larger Eden will make your Old even smaller, and increase the chance of Full GC. It's also relies on luck that Eden is almost empty exactly at the time the spike starts. Otherwise, large fractions of the spike will get copied to Old anyway & the Young GC pause will be large.

答：(避免完全GC)：如果大的下降是完全GC，因为您的Old Gen活动集大小已经相当大，您需要将堆至少增加~5 GB，或者重构您的应用程序以长期在内存中保留更少的对象(减少稳定状态的26 GB Old大小)。理论上，如果您确实尝试将G1NewSize设置为35%，以便~16 GB的峰值适合在伊甸园，它可以在伊甸园停留足够长的时间直到死亡，从而避免长时间的年轻GC暂停。然而，这在实践中不太可能起到很好的作用，因为更大的伊甸园会让你的旧伊甸园变得更小，并增加完全GC的机会。这也依赖于运气，在尖峰开始的时候，伊甸园几乎是空的。否则，大部分的峰值无论如何都会被复制到旧的&年轻GC的暂停将是很大的。

B: (Application refactor): This is the best approach, if it's possible to do. G1 is designed under the assumption that most objects in Eden are dead at the time of a Young GC, and that's not true for your application. The large amount of data you're receiving seems to stay live for a few minutes & then get collected. Perhaps your code reads the entire incoming dataset into memory, and then only after it's fully in-memory, copies it to a database or does some other aggregation, before discarding it. Refactoring to process the data incrementally would mean the chunks die quickly in Eden & little promotion happens. This is the ideal operating condition for G1 and would eliminate the Full GC.

B：(应用程序重构)：如果可能的话，这是最好的方法。G1是在假设Eden中的大多数对象在Young GC时都已死亡的假设下设计的，但对于您的应用程序来说并非如此。您正在接收的大量数据似乎会持续几分钟，然后被收集起来。也许您的代码将整个传入的数据集读取到内存中，然后仅在它完全在内存中之后，才将其复制到数据库或执行其他聚合，然后丢弃它。重构以增量方式处理数据将意味着数据块在伊甸园很快就会消亡，几乎不会有提升。这是G1的理想操作条件，并将完全消除GC。

更多回答

fortran - 在 Fortran 中，stop 内在函数是否预期在标准输出中打印 'STOP'？
我经常使用stop Fortran 中固有的因各种原因停止执行(主要是在测试失败后)。 program test1 stop end program 除了停止程序执行之外什么都不做。 prog
c - 即使使用命令 if(*str == 'stop' ) 并输入 stop，“While”循环也不会停止
我想编写一个函数，用字符 e 替换所有出现的字符 c。这些功能似乎正在发挥作用。然而，主要是，我希望能够重复输入一个字符串，扫描要替换的字符，扫描要替换的字符，并打印之前和之后的内容，直到输入的字符串
powershell - powershell Stop-Service 和 NET-STOP 有什么区别
在 powershell 中，我看到了多种停止服务的方法更现代的方式 Stop-Service wuauserv 而更传统的方式 NET STOP WUAUSERV 遗留方式 is much mor
java - 仅在句子中匹配秒词(如果是 STOP 或 stop 或 StOppp)
所以问题是我需要一个正则表达式，只有当它的 stop 也意味着 stopp 或 sstoooppp 时，它才会匹配第二个单词> 后跟一个空格。我需要得到这个词，我找不到任何正则表达式来做到这一点，因为
jQuery:将 .delay() 与 .stop() 一起使用会使 .stop() 无用。为什么？
我正在做这样的事情 http://jsfiddle.net/8ErSL/2/ 当您将鼠标悬停在任何文本框 (div) 上时，其中会出现一个小图标。我想阻止图标的淡入淡出效果在我不小心将鼠标悬停在 d
Android MediaRecorder Stop() 函数给出错误 E/MediaRecorder : stop failed: -1007
这段代码在 Debug模式下工作得很好，但当不是 Debug模式时它总是抛出运行时异常。 mMediaRecorder.stop(); 根据 Java 文档: Stops recordin
使用 MediaRecorder#stop 时出现 java.lang.RuntimeException : stop failed.
这是我的full code ，这里是my project ，当我在 #onCreate 中使用 MediaRecorder#stop 时，它会引发 java.lang.RuntimeException
c# - PowerShell Stop-Job/Stop Job() 需要 2 分钟才能停止作业
我使用 C# 编写了一个库并在 PowerShell 脚本中使用它。 C# 库将大量数据加载到数据库中。我正在使用 Start-Job 来启动该过程，并且我正在监视一个文件是否有错误。但是，我发现即
algorithm - 跟进: Find the optimal sequence of stops where the number of stops are fixed
我正在尝试编写以下问题的代码: 在 a0, a1, ..., an 处有 n 个酒店，使得 0 dp(k)+(ai-ak)^2) dp(i) = dp(k)+(ai-ak)^2)
Python 异步 : event loop does not seem to stop when stop method is called
我有一个简单的测试，我使用 run_forever 方法运行 Python asyncio 事件循环，然后立即在另一个线程中停止它。但是，事件循环似乎并没有终止。我有以下测试用例: import as
java - EC2 Java StartInstancesRequest 从 "pending"变为 "stopping"再到 "stopped"
我有以下情况: 专用租赁 m4.large 运行 RHEL6 的 EC2 实例使用 AWS 控制台手动启动它效果很好尝试启动它的 Lambda 函数(用 Java 编写)失败，因为实例状态为:已停
java - Yajsw Stop INFO Log message while start stop daemon 在linux下
我正在使用 Yajsw 将我的应用程序作为守护进程运行。对于状态调用，我希望看到“正在运行”或“已停止”，但我收到的消息如下所示 SW043305-SRV01:/etc/init.d # ./tes
tomcat - service tomcat start/stop 和 ./catalina.sh run/stop 有什么区别
在 Tomcat 或 TomEE 中，service tomcat start/stop 和 ./catalina.sh run/stop 有什么区别？他们做的事情完全一样吗？最佳答案 catal
C++ 蛇克隆 : timer function ignores given stop time and stops at it's own fixed time
我正在尝试使用 C++ 和 OpenGL/GLUT 制作一个 Snake 克隆。然而，我一直在编程允许输入 Action 之间的短时间间隔时遇到问题。我已经尝试了一些计时方法，最后我为它创建了一个类(
java - Server Stop responding because of [Pool-Cleaner] :Tomcat Connection Pool but has failed to stop it. 这很可能造成内存泄漏
问题: org.apache.catalina.loader.WebappClassLoader - The web application [/…] appears to have started
c++ - Qt : How can I make a layout section stop expanding once the widgets its hosting stops expanding too
我正在尝试以下实验: 我有两个QpushButtons，比如PushA 和PushB。现在 PushA 在 QHBoxLayout 中，PushB 也在它自己的 QHBoxLayout 中。这两个水平
linux - 无法启动 : The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop
我已经在我的 windows 10 机器上安装了 Docker for Windows。当我尝试从“windows 容器”“切换到 linux 容器”时，出现错误。 Unable to start:
android - java.lang.RuntimeException : stop failed at android. 媒体.MediaRecorder.stop(MediaRecorder.java)
我在我的应用程序中集成了摄像头。当用户单击捕获按钮时，我隐藏了工具栏，以便摄像头预览屏幕尺寸增加。这会导致应用程序在停止在线录制时崩溃 - mMediaRecorder.stop(); 。 java.
R stop() 函数中的域参数有什么作用？
运行功能时 stop("m Sys.setenv(LANG = "fr") > 2 + x Erreur : objet 'x' introuvable > Sys.setenv(LANG = "en
ubuntu - 代客状态显示 "is stopped"
我有一个 Windows 10 内部版本，我正在尝试安装 cpriego/valet-linux使用 wsl2 我已经安装了 composer、php 和所有其他的要求。现在当我做 valet st

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java how to avoid stop-the-world garbage collection(Java如何避免停止世界垃圾回收)