we have an market app that streams market data, this gets a spike of data on a daily basis.
due to the allocations during the spike the VM ends up doing a stop-the-world garbage collection.
From the usage graphs it looks like all the allocation during the spike goes directly to old gen? are there any g1gc parameters that can be used to change this behavior? i am thinking by keeping them in eden space we'll be able to avoid the STW-gc
我们有一个市场应用程序,它可以流传输市场数据,每天都会有大量数据。由于峰值期间的分配,VM最终执行停止世界垃圾收集。从使用率图上看,在峰值期间所有的分配都直接流向了老一代?是否有任何g1gc参数可用于更改此行为?我在想,把它们留在伊甸园里,我们就能避开STW-GC
tried -XX:MaxTenuringThreshold=15 does not seem to have an effect.
尝试-XX:MaxTenuringThreshold=15似乎没有效果。
using java 8 with g1gc
使用带有g1gc的Java 8
Total Heap Usage
总堆使用率
G1 Eden space: during the spike the usage is going down
G1伊甸园空间:在高峰期间,使用率正在下降
G1 old gen: looks like all allocation during the spike is going directly to old gen
G1老一代:看起来在峰值期间所有的分配都直接流向了老一代
更多回答
no that does not answer my question @Pino i know that there are cases when STW-GCs are necessary. the question is in my particular usage scenario are there any parameters that I can use to avoid STW-GC
不,这没有回答我的问题@皮诺,我知道在某些情况下,STW-GC是必要的。问题是在我的特定使用场景中,有没有什么参数可以用来避免STW-GC
Think about it this way: if there was a parameter which can avoid stop-the-world pauses even when we don’t know anything about their cause, like in your example, what do you think would be the reason that this parameter is not on by default?
这样想一想:如果有一个参数可以避免停止世界暂停,即使我们不知道它们的原因,就像你的例子中的那样,你认为这个参数默认没有打开的原因是什么?
oh no - i am not looking for a magical parameter that would avoid stop-the-world pause, i have explained what is happening before the pause and i am asking if there are any tunables that would help in this situation.
哦,不-我不是在寻找一个神奇的参数,将避免停止世界暂停,我已经解释了暂停之前发生了什么,我问是否有任何可调参数,将有助于在这种情况下。
You didn’t explain what’s happening, you just described the symptoms we can see in these pictures. You also provided the conjecture that the symptoms imply that allocations went directly to the Old Gen, which is not backed by anything. The graph would look the same if just all young objects were promoted to old. But regardless of whether the conjecture is correct or not, it still doesn’t explain why this happens and without knowing why it happens, there is no way to fix the issue. Except for a --do-not-make-a-spike-a-day
option which does not exist.
你没有解释发生了什么,你只是描述了我们在这些图片中看到的症状。你还提出了一个猜测,即这些症状意味着拨款直接流向了没有任何支持的老一代人。如果所有年轻的物体都被提升为老年物体,这张图看起来会是一样的。但无论猜测是否正确,它仍然无法解释为什么会发生这种情况,在不知道为什么会发生的情况下,没有办法解决这个问题。除了一个不存在的--不做一天一个尖峰的选项。
The problem is likely that your application is allocating a very large amount of long-lived objects in a short amount of time. As other commenters have mentioned, you cannot stop all STW pauses, but given your large heap & the fact that you're posting here, it probably means you hit a very long STW pause - maybe even a Full GC - during the spike. Enabling detailed logs with -verbose:gc -XX:+PrintGCDetails
would give us better details to help.
问题很可能是您的应用程序在很短的时间内分配了大量的长时间对象。正如其他评论者所提到的,你不能停止所有的STW暂停,但考虑到你的大堆&你在这里发帖的事实,这可能意味着你在峰值期间遇到了非常长的STW暂停-甚至可能是完整的GC。使用-Verbose:GC-XX:+PrintGCDetail启用详细日志将为我们提供更好的帮助详细信息。
Since you didn't provide GC logs, it's difficult to say exactly, but there are some observations from your plots that can help us figure it out:
由于您没有提供GC日志,所以很难准确地说出来,但您的图表中有一些观察结果可以帮助我们弄清楚:
The Eden space usage gets very small, relative to the steady state.
相对于稳定状态,伊甸园空间的使用量变得非常小。
Under normal conditions in G1, a "Young GC" is by far the most common GC, and it includes a STW pause that should ideally be small. Most of the points on your plots are probably Young GC's; when a GC event happens, the GC log prints the region sizes & a log analysis/plotter can process them. In Young GC, G1 starts at the reference roots of your application, scans the reference tree deeper and deeper to find all "live" objects, and copies them to one of two "Survivor" spaces.
在G1的正常情况下,“年轻GC”是迄今为止最常见的GC,它包括STW停顿,理想情况下停顿应该很小。曲线图上的大多数点可能是Young GC的;当GC事件发生时,GC日志打印区域大小&日志分析/绘图仪可以处理它们。在Young GC中,G1从应用程序的引用根开始,一遍又一遍地扫描引用树以找到所有的“活动”对象,并将它们复制到两个“Survivor”空间之一。
Roughtly speaking, because the length of a Young GC STW pause is proportional to the amount of live objects, G1 will shrink the size of Eden to try to ensure future pauses meet the latency target.
粗略地说,由于Young GC STW暂停的长度与活动对象的数量成正比,因此G1将缩小伊甸园的大小,以尝试确保未来的暂停满足延迟目标。
The Old Gen usage has a rapid increase during the traffic spike of approx. 16GB. It is nearly monotonic, but contains a very small decrease mid way. Increasing MaxTenuringThreshold
from its default of 6 to 15 didn't improve things.
在流量高峰期间,旧一代的使用量有一个快速的增长,大约。16 GB。它几乎是单调的,但在中途包含了非常小的降幅。将MaxTenuringThreshold的默认值从6增加到15并没有改善情况。
Since your application continues to allocate lots of memory quickly, and the Eden space is now smaller, it fills up fast. The time between when an object is allocated & when a Young GC processes it is now even shorter, meaning those objects have less time to die. On average, this means a higher fraction will need to be copied to Survivor. By default, the Survivor region is just 1/6'th the size of Eden. When the fraction of live Eden objects is high, existing Survivor objects will get prematurely promoted to Old to free up space. Your application may allocate so much that objects promote directly from Young -> Old. This explains why you see a rapid increase in the size of Old Gen. And since objects don't stay in Young or Survivor long enough to reach even the default MaxTenuringThreshold, explaining why increasing it did nothing.
由于您的应用程序继续快速地分配大量内存,并且现在的Eden空间更小,因此它很快就会被填满。从分配对象到青年GC处理对象之间的时间现在更短了,这意味着这些对象有更少的死亡时间。平均而言,这意味着需要将更高的比例复制到Survivor。默认情况下,幸存者区域的大小只有伊甸园的六分之一。当活的伊甸园对象的比例很高时,现有的幸存者对象将提前升级为旧对象以释放空间。您的应用程序可能会分配如此之多的资源,以至于对象直接从Young->Old升级。这解释了为什么您会看到Old Gen.的大小快速增加,并且由于对象在Young或Survivor中停留的时间不够长,甚至不足以达到默认的MaxTenuringThreshold,解释了为什么增加它没有任何作用。
Normally, G1 tries to collect Old concurrently with a series of Mixed collections. The small dip during the spike suggests at least one Mixed collection may have happened, but it freed up almost no space. Since there are a few more points on the plot before the big drop, that may mean no further Mixed collections ran or they ran but freed up very little space. The large drop could be a Mixed collection, but given that you're asking about how to stop bad pause behavior, it's probably a Full GC. With your heap size (>40GB) and the amount of live data in your Old gen (~26GB), a Full GC would generally be quite long.
通常情况下,G1会尝试同时收集Old和一系列混合集合。峰值期间的小幅下跌表明,可能至少发生了一次混合收集,但它几乎没有腾出任何空间。由于在大降幅之前还有几个点,这可能意味着没有更多的混合集合运行,或者它们运行但释放的空间很少。较大的降幅可能是混合的,但考虑到您正在询问如何阻止糟糕的暂停行为,这可能是一个完整的GC。考虑到您的堆大小(>40 GB)和旧Gen中的活动数据量(大约26 GB),完整的GC通常会相当长。
Suggested strategies:
建议的战略:
A: (Avoid Full GC): If it's true the large drop is a Full GC, because your Old gen live set size is already quite large, you need to either increase your heap by at least ~5GB, or refactor your application to keep fewer objects in memory long term (reduce the steady state 26GB Old size). Theoretically, if you did try setting G1NewSize to 35% so that the ~16GB spike fits within Eden, it could stay in Eden long enough to die, avoiding a long Young GC pause. This isn't likely to work well in practice, though, since a larger Eden will make your Old even smaller, and increase the chance of Full GC. It's also relies on luck that Eden is almost empty exactly at the time the spike starts. Otherwise, large fractions of the spike will get copied to Old anyway & the Young GC pause will be large.
答:(避免完全GC):如果大的下降是完全GC,因为您的Old Gen活动集大小已经相当大,您需要将堆至少增加~5 GB,或者重构您的应用程序以长期在内存中保留更少的对象(减少稳定状态的26 GB Old大小)。理论上,如果您确实尝试将G1NewSize设置为35%,以便~16 GB的峰值适合在伊甸园,它可以在伊甸园停留足够长的时间直到死亡,从而避免长时间的年轻GC暂停。然而,这在实践中不太可能起到很好的作用,因为更大的伊甸园会让你的旧伊甸园变得更小,并增加完全GC的机会。这也依赖于运气,在尖峰开始的时候,伊甸园几乎是空的。否则,大部分的峰值无论如何都会被复制到旧的&年轻GC的暂停将是很大的。
B: (Application refactor): This is the best approach, if it's possible to do. G1 is designed under the assumption that most objects in Eden are dead at the time of a Young GC, and that's not true for your application. The large amount of data you're receiving seems to stay live for a few minutes & then get collected. Perhaps your code reads the entire incoming dataset into memory, and then only after it's fully in-memory, copies it to a database or does some other aggregation, before discarding it. Refactoring to process the data incrementally would mean the chunks die quickly in Eden & little promotion happens. This is the ideal operating condition for G1 and would eliminate the Full GC.
B:(应用程序重构):如果可能的话,这是最好的方法。G1是在假设Eden中的大多数对象在Young GC时都已死亡的假设下设计的,但对于您的应用程序来说并非如此。您正在接收的大量数据似乎会持续几分钟,然后被收集起来。也许您的代码将整个传入的数据集读取到内存中,然后仅在它完全在内存中之后,才将其复制到数据库或执行其他聚合,然后丢弃它。重构以增量方式处理数据将意味着数据块在伊甸园很快就会消亡,几乎不会有提升。这是G1的理想操作条件,并将完全消除GC。
更多回答
我是一名优秀的程序员,十分优秀!