cms gc日志分析

CMS 垃圾收集器解释

Posted by gambol on March 16, 2016

CMS gc日志分析


本文适合稍微有一些jvm基础的人看

JVM内存结构

CMS,全称Concurrent Low Pause Collector,是jdk1.4后期版本开始引入的新gc算法,在jdk5和jdk6中得到了进一步改进,也是我们线上正在使用的垃圾收集器.

介绍cms之前, 先简单介绍一下JVM的内存结构. 下面这个图不错. 画出来了所有JVM里的内存块.

jvm-memory-segments1.jpg-17.6kB

如图所示, 对于分代管理的垃圾收集器, HEAP分成三块.

  1. Eden: 存放最新进来的数据. 也叫Young
  2. Survivor: 一般分成s0, 和 s1, ygc时,将Eden的数据收集到s0或者s1.
  3. Old: N次ygc之后,还存活的对象, 会挪到Old区. Old区又叫Tenured

method area(方法区)其实不属于heap(堆). 在JDK8之前, 方法区和堆放在一起, 并且在FGC时,一起进行收集(虽然基本收集不出什么垃圾来)

在JDK8时, 方法区抽离出来单独放在了一个metaspace区. 不再和heap怼在一起了.

cms步骤

阶段 说明 是否会stw
Initial Mark 单线程运行, 寻找离类加载器最近的live对象
Concurrent Marking 根据Initial Mark的结果的对象, 遍历Old区,找到所有的live对象.
Remark 重新检查Concurrent Mark里增加或者删除的对象
Concurrent Sweep 找到mark过程中所有的标记为 dead的对象, 进行清除
Resetting 重置对象的状态

cms gc的触发条件

  1. 程序调用System.gc() 不过线上一般会在启动参数里加上DisableExplicitGC禁止这种做死行为
  2. old区剩余空间达到阈值 可以通过在启动参数里指定 CMSInitiatingOccupancyFraction来控制. 默认情况下, jdk5里的CMSInitiatingOccupancyFraction 为 68%, jdk6/7里为92%
  3. 永久代剩余空间达到阈值 忘了这个事情吧.
  4. YGC时提示: “promotion failed” 这是因为YGC时,S区域放不下这些对象,而此时Old区也放不下. 此时不一定是Old区剩余空间不足,也可能是晋升的对象需要连续的空间,而Old区连续的空间不足导致的。这个时候,也会触发cms gc
  5. concurrent mode failure 这是因为CMS GC的时候同时YGC进行着,导致CMS完成前老年代空间被占满(full promotion guarantee failure),这个错误会导致Full GC,导致GC时间变长,可以调小CMSInitiatingOccupancyFraction或调小新生代, 扩大Old区可用区域

promotion failed的解法

CMS回收的碎片化问题,基本是个无解的问题。常见的处理方式主要有:

  1. 减少应用产生对象的数量:降低CMS碎片化的产生的可能。
  2. 减少应用产生对象的大小:大对象容易触发老年代不足。比如当前剩余80M,如果对象是60M,那么可以放下;如果对象是100M,就放不小。
  3. 减少应用对象的生命周期:尽量在年轻代回收掉。
  4. 扩大年老代的大小:这种情况可以缓解,但是只是扩大了空间,延长了发生的时间。但是这种模式也可能有一定问题,由于堆变得更大了,可能造成整理堆的时间较长,外围停顿时间更长。
  5. 不用CMS, 用G1

cms参数以及含义

常见的CMS相关参数是:

  1. -XX:ParallelGCThreads = 3 YGC的线程数.默认为 (cpu <= 8) ? cpu : 3 + ((cpu * 5) / 8)
  2. -XX:ParallelCMSThreads = 1, CMS回收的线程数
  3. -XX:CMSFullGCsBeforeCompaction = 1, 在N次CMS gc后, 对old区进行压缩, 减少内存碎片
  4. -XX:CMSInitiatingOccupancyFraction=70. 出发CMS gc的内存使用比例阈值. 不过在使用过程中, 一般都在比这个小的阈值时,开始进行cms gc
  5. -XX:+UseCMSInitiatingOccupancyOnly. 如果指定了这个选项, 就不会出现上面的这个现象
  6. -XX:CMSScavengeBeforeRemark=true 在remark之前,促发一次ygc,减少remark所需要的时间

线上使用的参数是:

-Xms3072m  最小heap大小
-Xmx3072m  最大heap大小
-XX:NewSize=675m   eden + Survivor 总共大小
-XX:PermSize=256m -XX:MaxPermSize=256M    perm大小
-server  以服务器模式运行(其实默认就是这个)
-XX:+DisableExplicitGC
-verbose:gc
-XX:+PrintGCDateStamp
-XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGC -XX:+UseConcMarkSweepGC
-XX:CMSFullGCsBeforeCompaction=1 (每次CMS都整理碎片)
-XX:+PrintSafepointStatistics
-XX:SurvivorRatio=8  (eden:s0:s1 = 8:1:1)
-XX:+TieredCompilation -XX:CICompilerCount=4 (分层编译, 最大有4个线程来编译, 成native code)

cms日志实例

initial-mark

2016-03-16T18:05:49.929+0800: 2613.504: [GC [1 CMS-initial-mark: 1227897K(2454528K)] 1261140K(3076608K), 0.0314290 secs] [Times: user=0.02 sys=0.01, real=0.03 secs]

CMS进入第一个阶段, initial-mark. 当时的情况是:old区总共大小为2454528K, 使用的大小为1227897K. 整个heap大小为3076608K, 使用了1261140K. 执行完成,总共花了0.03s

额,这里有一个问题是, 为什么使用比例才50%时就触发了cms呢?原因我也不知道.

concurrent-mark

2016-03-16T18:05:49.961+0800: 2613.536: [CMS-concurrent-mark-start]
2016-03-16T18:05:50.572+0800: 2614.147: [CMS-concurrent-mark: 0.611/0.611 secs] [Times: user=1.35 sys=0.26, real=0.61 secs]

进行concurrent-mark阶段.扫描live对象

concurrent-preclean

2016-03-16T18:05:50.572+0800: 2614.147: [CMS-concurrent-preclean-start]
2016-03-16T18:05:50.584+0800: 2614.159: [CMS-concurrent-preclean: 0.011/0.012 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]

并发预清理, 花了0.01sec

concurrent-abortable-preclean

JDK5之后, CMS 增加一个并发可中止预清理(concurrent abortable preclean)阶段. concurrent abortable preclean段,运行在concurrent-prelcean 和remark之间,直到获得所期望的eden区占用率。

在这个过程中,一般也会伴随着一个ygc,来清理eden区. 这个过程中不会stop the world

2016-03-16T18:05:50.584+0800: 2614.159: [CMS-concurrent-abortable-preclean-start]
2016-03-16T18:05:50.858+0800: 2614.433: Application time: 0.8973190 seconds
2016-03-16T18:05:50.860+0800: 2614.435: Total time for which application threads were stopped: 0.0021890 seconds
2016-03-16T18:05:51.001+0800: 2614.577: Application time: 0.1411400 seconds
2016-03-16T18:05:51.014+0800: 2614.589: Total time for which application threads were stopped: 0.0121590 seconds
2016-03-16T18:05:51.298+0800: 2614.873: Application time: 0.2846280 seconds
2016-03-16T18:05:51.309+0800: 2614.884: Total time for which application threads were stopped: 0.0106730 seconds
2016-03-16T18:05:51.329+0800: 2614.904: Application time: 0.0198960 seconds
2016-03-16T18:05:51.331+0800: 2614.906: Total time for which application threads were stopped: 0.0023600 seconds
2016-03-16T18:05:52.655+0800: 2616.230: Application time: 1.3236740 seconds
2016-03-16T18:05:52.659+0800: 2616.234: Total time for which application threads were stopped: 0.0042470 seconds
2016-03-16T18:05:53.646+0800: 2617.221: Application time: 0.9869170 seconds
2016-03-16T18:05:53.648+0800: 2617.223: [GC2016-03-16T18:05:53.648+0800: 2617.224: [ParNew: 578034K->20882K(622080K), 0.0305740 secs] 1805932K->1250183K(3076608K), 0.0316520 secs] [Times: user=0.10 sys=0.00, real=0.03 secs]
2016-03-16T18:05:53.680+0800: 2617.255: Total time for which application threads were stopped: 0.0342680 seconds
2016-03-16T18:05:53.775+0800: 2617.350: Application time: 0.0943570 seconds
2016-03-16T18:05:53.777+0800: 2617.352: Total time for which application threads were stopped: 0.0021280 seconds
2016-03-16T18:05:53.940+0800: 2617.515: Application time: 0.1635230 seconds
2016-03-16T18:05:53.949+0800: 2617.524: Total time for which application threads were stopped: 0.0087670 seconds
2016-03-16T18:05:53.955+0800: 2617.530: Application time: 0.0059520 seconds
2016-03-16T18:05:53.960+0800: 2617.535: Total time for which application threads were stopped: 0.0046460 seconds
2016-03-16T18:05:53.964+0800: 2617.539: Application time: 0.0045690 seconds
2016-03-16T18:05:53.967+0800: 2617.542: Total time for which application threads were stopped: 0.0026540 seconds
2016-03-16T18:05:53.967+0800: 2617.542: Application time: 0.0001580 seconds
2016-03-16T18:05:53.968+0800: 2617.544: Total time for which application threads were stopped: 0.0014360 seconds
2016-03-16T18:05:54.001+0800: 2617.576: Application time: 0.0328520 seconds
2016-03-16T18:05:54.014+0800: 2617.589: Total time for which application threads were stopped: 0.0129880 seconds
2016-03-16T18:05:55.057+0800: 2618.632: [CMS-concurrent-abortable-preclean: 4.346/4.473 secs] [Times: user=9.41 sys=1.56, real=4.48 secs]

remark

2016-03-16T18:05:55.057+0800: 2618.633: Application time: 1.0430630 seconds
2016-03-16T18:05:55.059+0800: 2618.634: [GC[YG occupancy: 300145 K (622080 K)]2016-03-16T18:05:55.059+0800: 2618.634: [Rescan (parallel) , 0.0618970 secs]2016-03-16T18:05:55.121+0800: 2618.696: [weak refs processing, 0.0299250 secs]2016-03-16T18:05:55.151+0800: 2618.726: [scrub string table, 0.0029210 secs] [1 CMS-remark: 1229300K(2454528K)] 1529446K(3076608K), 0.0956850 secs] [Times: user=0.28 sys=0.00, real=0.09 secs]
2016-03-16T18:05:55.155+0800: 2618.731: Total time for which application threads were stopped: 0.0980400 seconds

这个阶段会暂停应用, 多线程对内存重新进行标记. 暂停时间较长

concurrent-sweep

2016-03-16T18:05:55.156+0800: 2618.731: [CMS-concurrent-sweep-start]
2016-03-16T18:05:55.250+0800: 2618.825: Application time: 0.0946000 seconds
2016-03-16T18:05:55.252+0800: 2618.827: Total time for which application threads were stopped: 0.0021380 seconds
2016-03-16T18:05:56.252+0800: 2619.827: Application time: 1.0001840 seconds
2016-03-16T18:05:56.257+0800: 2619.832: Total time for which application threads were stopped: 0.0048040 seconds
2016-03-16T18:05:56.291+0800: 2619.866: [CMS-concurrent-sweep: 1.131/1.135 secs] [Times: user=2.46 sys=0.42, real=1.14 secs]

concorrent-reset

2016-03-16T18:05:56.291+0800: 2619.866: [CMS-concurrent-reset-start]
2016-03-16T18:05:56.302+0800: 2619.877: [CMS-concurrent-reset: 0.011/0.011 secs] [Times: user=0.03 sys=0.02, real=0.01 secs]
2016-03-16T18:05:56.621+0800: 2620.196: Application time: 0.3636370 seconds

参考

  1. jvm参数举例
  2. cms gc日志解释
  3. jvm内存管理基础
  4. cms gc日志解释2
  5. 一次jvm调优的经过
  6. JVM内存结构
  7. 一次CMS GC问题排查过程(理解原理+读懂GC日志)