去年十月份有位朋友从微信找到我,说他的程序内存要炸掉了。。。截图如下:

时间有点久,图片都被清理了,不过有点讽刺的是,自己的程序本身就是做监控的,结果自己出了问题,太尴尬了🤣🤣🤣
这个是甄别内存问题的第一步,通过 !address -summary 和 !eeheap -gc 两个命令基本就可以断定。
- 0:000> !address -summary
- Mapping file section regions...
- Mapping module regions...
- Mapping PEB regions...
- Mapping TEB and stack regions...
- Mapping heap regions...
- Mapping page heap regions...
- Mapping other regions...
- Mapping stack trace database regions...
- Mapping activation context regions...
-
- --- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- Free 237 7ffc`1e222000 ( 127.985 TB) 99.99%
594 3`b9b20000 ( 14.901 GB) 95.96% 0.01% - Heap 370 0`12a2a000 ( 298.164 MB) 1.88% 0.00%
- Image 1248 0`0ee5a000 ( 238.352 MB) 1.50% 0.00%
- Stack 315 0`06780000 ( 103.500 MB) 0.65% 0.00%
- Other 13 0`001d7000 ( 1.840 MB) 0.01% 0.00%
- TEB 105 0`000d2000 ( 840.000 kB) 0.01% 0.00%
- PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
-
- --- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_PRIVATE 1178 3`ce03d000 ( 15.219 GB) 98.00% 0.01%
- MEM_IMAGE 1409 0`0f6fd000 ( 246.988 MB) 1.55% 0.00%
- MEM_MAPPED 59 0`04694000 ( 70.578 MB) 0.44% 0.00%
-
- --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
- MEM_FREE 237 7ffc`1e222000 ( 127.985 TB) 99.99%
- MEM_COMMIT 2326 3`c7543000 ( 15.115 GB) 97.33% 0.01%
- MEM_RESERVE 320 0`1a88b000 ( 424.543 MB) 2.67% 0.00%
-
- 0:000> !eeheap -gc
- Number of GC Heaps: 1
- generation 0 starts at 0x0000009902B57670
- generation 1 starts at 0x0000009902A56810
- generation 2 starts at 0x00000095318C1000
- ephemeral segment allocation context: (0x0000009902D724A8, 0x0000009902D724C0)
- segment begin allocated committed allocated size committed size
- ...
- 00000098FFBE0000 00000098FFBE1000 0000009902D724A8 0000009902D7D000 0x31914a8(51975336) 0x319c000(52019200)
- Large object heap starts at 0x00000095418C1000
- segment begin allocated committed allocated size committed size
- 00000095418C0000 00000095418C1000 00000095475D8D98 00000095475D9000 0x5d17d98(97615256) 0x5d18000(97615872)
- Total Allocated Size: Size: 0x398e6cbc8 (15450164168) bytes.
- Total Committed Size: Size: 0x398e7b000 (15450222592) bytes.
- ------------------------------
- GC Allocated Heap Size: Size: 0x398e6cbc8 (15450164168) bytes.
- GC Committed Heap Size: Size: 0x398e7b000 (15450222592) bytes.
-
从输出信息看,好家伙。。。进程提交内存是 15G, 托管堆差不多也是 15G,这就说明当前是相对简单的 托管内存泄漏。
要想知道到底是什么在泄漏,可以先到托管堆上看看有没有什么异常的对象,使用 !dumpheap -stat 命令。
- 0:000> !dumpheap -stat
- Statistics:
- MT Count TotalSize Class Name
- ...
- 00007ff8815d0f88 7260233 290409320 System.Collections.ArrayList
- 00007ff8815e6830 7313696 326240826 System.String
- 000000952fbbd2b0 12141398 509369998 Free
- 00007ff880685cf0 7254983 928637824 System.Diagnostics.ProcessInfo
- 00007ff88065f7d0 7256845 2031916600 System.Diagnostics.Process
- 00007ff8815e6ea8 7391338 2230744600 System.Object[]
- 00007ff88068fa70 186800748 8966435904 System.Diagnostics.ThreadInfo
-
从卦象上来看,真的太奇怪了,如果大家了解 Process 类,应该知道 ProcessInfo 和 ThreadInfo 都是挂在 Process 下的,而且 ThreadInfo 对象高达 1.8亿,真的太🐂👃了,看样子程序是在不断的做 Process.Start 操作吧。
接下来要探究的问题是 ThreadInfo 到底正在被谁持有???可以挑几个看看它们的 !gcroot, 首尾法查了几个,都是没有引用根,如下所示:
- 0:000> !gcroot 0000009531e8f760
- Found 0 unique roots (run '!GCRoot -all' to see all roots).
- 0:000> !gcroot 0000009531e8f670
- Found 0 unique roots (run '!GCRoot -all' to see all roots).
- 0:000> !gcroot 0000009531e8f378
- Found 0 unique roots (run '!GCRoot -all' to see all roots).
-
既然对象没有引用根,为什么 GC 不回收它呢?这里就需要谈经验了,在我之前分析的很多关于内存泄漏 的dump中,我都是从 生产端 找问题,貌似还没有出现一个从 消费端 找问题的案例,参考模型如下:

生产者既然没问题,那消费端能有什么问题呢?大家可以想一想,托管堆的消费端起码有两个角色。
GC
Finalizer 线程
GC 肯定是不会出问题的,那就只能怀疑 Finalizer 线程出了什么问题,可以用 !t 命令把所有线程调出来看看。
- 0:000> !t
- ThreadCount: 9566
- UnstartedThread: 0
- BackgroundThread: 88
- PendingThread: 0
- DeadThread: 9471
- Hosted Runtime: no
- Lock
- ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
- 0 1 40e18 000000952fbdfe50 26020 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 2 2 41ce8 000000952fc0e4d0 2b220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Finalizer)
- 4 4 41cb4 000000954c324970 27220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 5 5 41cb8 000000954c36d1e0 2027220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 6 6 41c58 000000954c33f070 2027220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 7 7 41c38 000000954c33f840 27220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 8 8 41e0c 000000954c333580 27220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 9 9 41e2c 000000954c354440 27220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- 10 10 41f24 000000954c355840 27220 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 STA
- ...
- XXXX 9446 0 000000957a233410 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9447 0 0000009579f83e30 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9450 0 000000957a46dcf0 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9449 0 000000957967c6e0 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9448 0 000000957aee0010 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9452 0 00000095796824a0 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
- XXXX 9451 0 000000957af05df0 8039820 Preemptive 0000000000000000:0000000000000000 000000952fbd2e40 0 MTA (Threadpool Completion Port)
-
这不看不知道,一看吓一跳,当前进程有 9566 个,死线程高达 9471 个 ,以过往经验,这小子不用线程池,用 new Thread 咯, 🐂👃🦆 ,吐槽结束,再看下 Finalizer 线程正在做什么,使用 ~2s & !dumpstack
- 0:002> ~2s
- ntdll!NtWaitForSingleObject+0xa:
- 00007ff8`8e220c8a c3 ret
- 0:002> !dumpstack
- OS Thread Id: 0x41ce8 (2)
- Current frame: ntdll!NtWaitForSingleObject+0xa
- Child-SP RetAddr Caller, Callee
- 0000009549e4e120 00007ff88b591118 KERNELBASE!WaitForSingleObjectEx+0x94, calling ntdll!NtWaitForSingleObject
- 0000009549e4e1c0 00007ff88da3e334 combase!MTAThreadWaitForCall+0x54 [d:\9147\com\combase\dcomrem\channelb.cxx:5657], calling KERNELBASE!WaitForSingleObject
- 0000009549e4e210 00007ff88d8fe089 combase!MTAThreadDispatchCrossApartmentCall+0x75 [d:\9147\com\combase\dcomrem\chancont.cxx:193], calling combase!MTAThreadWaitForCall [d:\9147\com\combase\dcomrem\channelb.cxx:5619]
- 0000009549e4e240 00007ff88da3e13d combase!CRpcChannelBuffer::SendReceive2+0x64b [d:\9147\com\combase\dcomrem\channelb.cxx:4796], calling combase!MTAThreadDispatchCrossApartmentCall [d:\9147\com\combase\dcomrem\chancont.cxx:156]
- 0000009549e4e2b0 00007ff88e1bb6f7 ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
- ...
- 0000009549e4f5d0 00007ff8827d79cd clr!ManagedThreadBase_DispatchOuter+0x75, calling clr!ManagedThreadBase_DispatchMiddle
- 0000009549e4f5e0 00007ff8828601af clr!EEConfig::GetConfigDWORD_DontUse_+0x3b, calling clr!EEConfig::GetConfiguration_DontUse_
- 0000009549e4f660 00007ff8828574fa clr!FinalizerThread::FinalizerThreadStart+0x10a, calling clr!ManagedThreadBase_DispatchOuter
- 0000009549e4f6a0 00007ff8827d55b9 clr!EEHeapFreeInProcessHeap+0x45, calling kernel32!HeapFreeStub
- 0000009549e4f700 00007ff882882e8f clr!Thread::intermediateThreadProc+0x86
- 0000009549e4f780 00007ff882882e6f clr!Thread::intermediateThreadProc+0x66, calling clr!_chkstk
- 0000009549e4f7c0 00007ff88dcd13d2 kernel32!BaseThreadInitThunk+0x22
- 0000009549e4f7f0 00007ff88e2003c4 ntdll!RtlUserThreadStart+0x34
-
从堆栈信息看,原来是终结器线程卡死了,从 MTAThreadDispatchCrossApartmentCall 方法看,貌似是 MTA 向 STA 做了一个调用,到这里有经验的朋友应该知道,这和 com 组件有关系了,也就是说 Finalizer 线程无法释放由 STA 线程创建的 COM 对象,那到底是哪一个线程创建了没有被合理释放的 COM 组件呢?
说实话,这个对COM组件不了解的话,很难找出答案,但天无绝人之路,当我回头看 线程列表 的时候,发现居然有 38 个 STA线程,截图如下:

这里面肯定有问题,接下来抽一个线程看看调用栈如何。
- 0:004> !clrstack
- OS Thread Id: 0x41cb4 (4)
- Child SP IP Call Site
- 000000954da1ee38 00007ff88e220c8a [HelperMethodFrame: 000000954da1ee38] System.Threading.Thread.SleepInternal(Int32)
- 000000954da1ef30 00007ff88138c20a *** WARNING: Unable to verify checksum for mscorlib.ni.dll
- System.Threading.Thread.Sleep(Int32)
- 000000954da1ef60 00007ff82322437f xxx.CFileLogTask.DoWork()
- 000000954da1f160 00007ff8232234d6 xxx.CTask.InitStart()
- 000000954da1f240 00007ff8813c31d3 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
- 000000954da1f310 00007ff8813c3064 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
- 000000954da1f340 00007ff8813c3032 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
- 000000954da1f390 00007ff8813bc812 System.Threading.ThreadHelper.ThreadStart()
- 000000954da1f5e8 00007ff8827d6bb3 [GCFrame: 000000954da1f5e8]
- 000000954da1f938 00007ff8827d6bb3 [DebuggerU2MCatchHandlerFrame: 000000954da1f938]
-
接下来反编译下 xxx.CFileLogTask.DoWork() 方法看看它是如何被 Thread 执行的。

到这里终于水落石出,罪魁祸首在 CurrThread.SetApartmentState(ApartmentState.STA); 这一句上,我也不知道为啥开个 Thread 还要给个 STA。。。
本次事故主要是因为在 STA 线程上用到了 COM 组件,导致让 MTA 模型的 Finalizer 线程去释放时被卡死,而这个Thread又没有用 Application.Run 启动消息循环,STA也是 Sleep 状态,我个人感觉两者无法通讯,给到朋友的建议是去掉 Thread 的 STA。
其实这里有一个很好的点就是,当内存暴涨,不一定是 生产端 的问题,也有可能是 消费端 。