• 8.稳定性专题


    1. anr原理

    • android消息机制
    • 拆炸弹,埋炸弹
      • inpu超时5s
      • 广播超时 前台15s 后台60s
      • 服务超时
    • 原因
      • 应用
        • 函数阻塞:死循环,主线程io、 处理大数据
        • 锁:主线程等待子线程锁,
        • 内存紧张 :系统分配给一个应用的内存是有上限的,长期处于内存紧张,会导致频繁内存交换,进而导致应用的一些操作超时
      • 系统
        • CPU被抢占:一般来说,前台在玩游戏,可能会导致你的后台广播被抢占CPU
        • 系统服务无法及时响应:比如获取系统联系人等,系统的服务都是Binder机制,服务能力也是有限的,有可能系统服务长时间不响应导致ANR
        • 其他应用占用的大量内存

    怎么着手去处理呢?

    2 ANR in

    • mainlog 关键字 ANR in
    //anr 发生时间
    06-16 16:16:28.590  1853  2073 E ActivityManager: ANR in com.android.camera (com.android.camera/.Camera)
    // anr 进程pid  如果为0,找个进程就被lmk杀死或者出现了出现了crash
    06-16 16:16:28.590  1853  2073 E ActivityManager: PID: 27661
    //原因是input timeout
    06-16 16:16:28.590  1853  2073 E ActivityManager: Reason: Input dispatching timed out (com.android.camera/com.android.camera.Camera, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 24.  Wait queue head age: 5511.1ms.)
    //cpu负载  Linux系统中  uptime可以得到一段时间负载
    //Load后面的三个数字的意思分别是1、5、15 分钟内正在使用和等待使用CPU 的活动进程的平均数
    06-16 16:16:28.590  1853  2073 E ActivityManager: Load: 16.25 / 29.48 / 38.33
    //负载信息抓取在ANR发生之后的8058ms
    06-16 16:16:28.590  1853  2073 E ActivityManager: CPU usage from 0ms to 8058ms later:
    //cpu占用率
    06-16 16:16:28.590  1853  2073 E ActivityManager:   58% 291/mediaserver: 51% user + 6.7% kernel / faults: 2457 minor 4 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   27% 317/mm-qcamera-daemon: 21% user + 5.8% kernel / faults: 15965 minor
    ....
    06-16 16:16:28.590  1853  2073 E ActivityManager:   4.4% 277/surfaceflinger: 1.8% user + 2.6% kernel / faults: 14 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   4% 203/mmcqd/0: 0% user + 4% kernel
    ....
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 29330/ksoftirqd/1: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager: 96% TOTAL: 56% user + 29% kernel + 6.3% iowait + 4.1% softirq
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    2.1 cpu

    最后一行名词解释:

    • user:用户态,kernel:内核态
    • faults:内存缺页,minor——轻微的,major——重度,需要从磁盘拿数据
    • iowait:IO使用(等待)占比
    • rq:硬中断,softirq:软中断

    注意:

    • iowait占比很高,意味着有很大可能,是io耗时导致ANR,具体进一步查看有没有进程faults major比较多。
    • 单进程CPU的使用率并不是以100%为上限,而是有几个核,就有百分之几百,如8核上限为800%。
    • ActivityManager: Load: 16.25 / 29.48 / 38.33:正在使用和等待使用CPU 的活动进程的平均数
    • 平均任务负载数和CPU占用率并没有必然联系,可以作为参考信息。可以通过分析各进程的CPU时间占用率,来判断是否为某些进程长期占用CPU导致该进程无法获取到足够的CPU处理时间,而导致发生ANR

    看ams源码就知道,打印套路是

    ActivityManager: Load: [Load1] / [Load5] / [Load15]
    
    ActivityManager: CPU usage from [上次采样与现在的时间差] ms to [当前采样与现在的时间差] ms ago/later:
    
    ActivityManager: [总的CPU时间占用率]% [PID]/[Process Name]: [用户态CPU时间占用率]% user + [kenelCPU时间占用率]% kernel + [IO等待CPU时间占用率]% iowait + [硬中断CPU时间占用率]% irq + [软中断CPU时间占用率]% softirq / faults: [次要页错误/主要页错误]minor/major
    
    ActivityManager: .....
    
    ActivityManager: [CPU时间占用率合计统计]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    主要看cpu top进程
    top进程:

    • kswapd0 cpu占用率偏高,系统内存偏低,不停kill进程并发生内存交换,系统整体运行会缓慢
    • logd CPU占用率偏高,也会引起系统卡顿和ANR,因为各个进程输出LOG的操作被阻塞从而执行的极为缓慢。
    • Vold占用CPU过高,会引起系统卡顿和ANR,请负责存储的同学先调查
    • qcom.sensor CPU占用率过高,会引起卡顿,请系统同学调查
    • 应用自身CPU占用率较高,高概率应用自身问题
    • 系统CPU占用率不高,但主线程在等待一个锁,高概率应用自身问题
    • 应用处于D状态,发生ANR,如果最后的操作是refriger,那么是应用被冻结了,正常情况下是功耗优化引起的。

    2.2 内存信息

    trace文件中的打印:

    • Total number of allocations 476778  进程创建到现在一共创建了多少对象
    • Total bytes allocated 52MB 进程创建到现在一共申请了多少内存
    • Total bytes freed 52MB   进程创建到现在一共释放了多少内存
    • Free memory 777KB    不扩展堆的情况下可用的内存
    • Free memory until GC 777KB  GC前的可用内存
    • Free memory until OOME 383MB  OOM之前的可用内存
    • Total memory 当前总内存(已用+可用)
    • Max memory 384MB 进程最多能申请的内存

    除了trace文件中有内存信息,普通的eventlog日志中,也有内存信息(不一定打印)

    04-02 22:00:08.195 1531 1544 I am_meminfo: [350937088,41086976,492830720,427937792,291887104]

    • Cached
    • Free,
    • Zram,
    • Kernel,Native

    Cached+Free的内存代表着当前整个手机的可用内存,如果值很小,意味着处于内存紧张状态。一般低内存的判定阈值为:4G 内存手机以下阀值:350MB,以上阀值则为:450MB

    ps:如果ANR时间点前后,日志里有打印onTrimMemory,也可以作为内存紧张的一个参考判断

    3 trace文件

    • main log可以得到anr基本信息,要得到阻塞的地方,还得靠trace文件,一般在/data/anr目录下,可在trace文件中搜索主线程堆栈
    • 堆栈信息是最重要的一个信息,展示了ANR发生的进程当前所有线程的状态。
    ----- pid 27661 at 2017-06-16 16:16:20 -----
    Cmd line: com.android.camera
    "main" prio=5 tid=1 Waiting
     | group="main" sCount=1 dsCount=0 obj=0x75a4b5c8 self=0xb4cf6500
     | sysTid=27661 nice=-10 cgrp=default sched=0/0 handle=0xb6f6cb34
     | state=S schedstat=( 11242036155 8689191757 38520 ) utm=895 stm=229 core=0 HZ=100
     | stack=0xbe4ea000-0xbe4ec000 stackSize=8MB
     | held mutexes=
     at java.lang.Object.wait!(Native method)
     -waiting on <0x09e6a059> (a java.lang.Object)
     at java.lang.Thread.parkFor$(Thread.java:1220)
     -locked <0x09e6a059> (a java.lang.Object)
     at sun.misc.Unsafe.park(Unsafe.java:299)
     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:810)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:970)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1278)
     at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:203)// block点
     at android.app.SharedPreferencesImpl$EditorImpl$1.run(SharedPreferencesImpl.java:366)
     at android.app.QueuedWork.waitToFinish(QueuedWork.java:88)
     at android.app.ActivityThread.handleStopActivity(ActivityThread.java:3605)
     at android.app.ActivityThread.access$1300(ActivityThread.java:153)
     at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1399)
     at android.os.Handler.dispatchMessage(Handler.java:102)
     at android.os.Looper.loop(Looper.java:154)
     at android.app.ActivityThread.main(ActivityThread.java:5528)
     at java.lang.reflect.Method.invoke!(Native method)
     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:740)
     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:630)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    字段含义
    tid=1线程号
    sysTid=27661主线程的线程号和进程号相同
    Waiting线程状态,其中state也是线程状态,如果state=D代表底层被blocked了。
    nicenice值越小,则优先级越高。因为是主线程此处nice=-10, 可以看到优先级很高了
    schedstat括号中的3个数字,依次是Running, Runable, Switch,Running时间。Running时间:CPU运行的时间,单位ns。 Runable时间:RQ队列的等待时间,单位ns。 Switch次数:CPU调度切换次数
    utm该线程在用户态所执行的时间,单位是jiffies
    stm该线程在内核态所执行的时间,单位是jiffies
    sCount此线程被挂起的次数
    dsCount线程被调试器挂起的次数,当一个进程被调试后,sCount会重置为0,调试完毕后sCount会根据是否被正常挂起增长,但是dsCount不会被重置为0,所以dsCount也可以用来判断这个线程是否被调试过
    self线程本身的地址

    线程状态?

    状态说明
    THREAD_ZOMBIE0TERMINATED
    THREAD_RUNNING1RUNNABLE or running now
    THREAD_TIMED_WAIT2TIMED_WAITING in Object.wait()
    THREAD_MONITOR3BLOCKED on a monitor
    THREAD_INITIALIZING5allocated not yet running
    THREAD_STARTING6started not yet on thread list
    THREAD_NATIVE7off in a JNI native method
    THREAD_VMWAIT8waiting on a VM resource
    THREAD_SUSPENDED9suspended usually by GC or debugger

    4 线程状态

    • java线程状态
      ![在这里插入图片描述](https://img-blog.csdnimg.cn/870c5458ddea449990a85ffa1da1c897.png#pic_center
      在这里插入图片描述但我们经常看到线程状态native,这是怎么回事呢,其实native是cpp代码的定义的线程状态 对应关系表

    在这里插入图片描述

    • main线程处于 BLOCK、WAITING、TIMEWAITING状态,那基本上是函数阻塞导致ANR
    • main线程无异常,则应该排查CPU负载和内存环境

    5 典型案例分析

    有了 234 节的日志分析,可以了解怎么着手分析日志了。

    5.1 主线程无卡顿,正常堆栈

    “main” prio=5 tid=1 Native
    | group=“main” sCount=1 dsCount=0 flags=1 obj=0x74b38080 self=0x7ad9014c00
    | sysTid=23081 nice=0 cgrp=default sched=0/0 handle=0x7b5fdc5548
    | state=S schedstat=( 284838633 166738594 505 ) utm=21 stm=7 core=1 HZ=100
    | stack=0x7fc95da000-0x7fc95dc000 stackSize=8MB
    | held mutexes=
    kernel: __switch_to+0xb0/0xbc
    kernel: SyS_epoll_wait+0x288/0x364
    kernel: SyS_epoll_pwait+0xb0/0x124
    kernel: cpu_switch_to+0x38c/0x2258
    native: #00 pc 000000000007cd8c /system/lib64/libc.so (__epoll_pwait+8)
    native: #01 pc 0000000000014d48 /system/lib64/libutils.so (android::Looper::pollInner(int)+148)
    native: #02 pc 0000000000014c18 /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
    native: #03 pc 00000000001275f4 /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
    at android.os.MessageQueue.nativePollOnce(Native method)
    at android.os.MessageQueue.next(MessageQueue.java:330)
    at android.os.Looper.loop(Looper.java:169)
    at android.app.ActivityThread.main(ActivityThread.java:7073)
    at java.lang.reflect.Method.invoke(Native method)
    at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:536)
    at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:876)

    • android_os_MessageQueue_nativePollOnce 很正常的空闲堆栈(android 消息机制),表明主线程正在等待新的消息
    • 可能有两个原因
      • 该ANR是CPU抢占或内存紧张等其他因素引起,可以去分析cpu 内存情况
      • 这份ANR日志抓取的时候,主线程已经恢复正常 ,可以关注抓取日志的时间和ANR发生的时间是否相隔过久,时间过久这个堆栈就没有分析意义了

    5.2 主线程耗时

    “main” prio=5 tid=1 Runnable
    | group=“main” sCount=0 dsCount=0 flags=0 obj=0x72deb848 self=0x7748c10800
    | sysTid=8968 nice=-10 cgrp=default sched=0/0 handle=0x77cfa75ed0
    | state=R schedstat=( 24783612979 48520902 756 ) utm=2473 stm=5 core=5 HZ=100
    | stack=0x7fce68b000-0x7fce68d000 stackSize=8192KB
    | held mutexes= “mutator lock”(shared held)
    at com.example.test.MainActivity$onCreate$2.onClick(MainActivity.kt:20)——关键行!!!
    at android.view.View.performClick(View.java:7187)
    at android.view.View.performClickInternal(View.java:7164)
    at android.view.View.access 3500 ( V i e w . j a v a : 813 ) a t a n d r o i d . v i e w . V i e w 3500(View.java:813) at android.view.View 3500(View.java:813)atandroid.view.ViewPerformClick.run(View.java:27640)
    at android.os.Handler.handleCallback(Handler.java:883)
    at android.os.Handler.dispatchMessage(Handler.java:100)
    at android.os.Looper.loop(Looper.java:230)
    at android.app.ActivityThread.main(ActivityThread.java:7725)
    at java.lang.reflect.Method.invoke(Native method)
    at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:526)
    at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1034)

    主线程正处于执行状态,看堆栈信息可知不是处于空闲状态,发生ANR是因为一处click监听函数里执行了耗时操作

    5.3 主线程被锁阻塞 (主线程等待锁)

    "main" prio=5 tid=1 Blocked
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x72deb848 self=0x7748c10800
      | sysTid=22838 nice=-10 cgrp=default sched=0/0 handle=0x77cfa75ed0
      | state=S schedstat=( 390366023 28399376 279 ) utm=34 stm=5 core=1 HZ=100
      | stack=0x7fce68b000-0x7fce68d000 stackSize=8192KB
      | held mutexes=
      at com.example.test.MainActivity$onCreate$1.onClick(MainActivity.kt:15)
      - waiting to lock <0x01aed1da> (a java.lang.Object) held by thread 3 ———关键行!!!  等待tid=3的锁 0x01aed1da
      at android.view.View.performClick(View.java:7187)
      at android.view.View.performClickInternal(View.java:7164)
      at android.view.View.access$3500(View.java:813)
      at android.view.View$PerformClick.run(View.java:27640)
      at android.os.Handler.handleCallback(Handler.java:883)
      at android.os.Handler.dispatchMessage(Handler.java:100)
      at android.os.Looper.loop(Looper.java:230)
      at android.app.ActivityThread.main(ActivityThread.java:7725)
      at java.lang.reflect.Method.invoke(Native method)
      at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:526)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1034)
       
      ........省略N行.....
       
      "WQW TEST" prio=5 tid=3 TimeWating
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x12c44230 self=0x772f0ec000
      | sysTid=22938 nice=0 cgrp=default sched=0/0 handle=0x77391fbd50
      | state=S schedstat=( 274896 0 1 ) utm=0 stm=0 core=1 HZ=100
      | stack=0x77390f9000-0x77390fb000 stackSize=1039KB
      | held mutexes=
      at java.lang.Thread.sleep(Native method)
      - sleeping on <0x043831a6> (a java.lang.Object)
      at java.lang.Thread.sleep(Thread.java:440)
      - locked <0x043831a6> (a java.lang.Object)
      at java.lang.Thread.sleep(Thread.java:356)
      at com.example.test.MainActivity$onCreate$2$thread$1.run(MainActivity.kt:22)
      - locked <0x01aed1da> (a java.lang.Object)————————————————————关键行!!!
      at java.lang.Thread.run(Thread.java:919)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    其中等待的锁是<0x01aed1da>,这个锁的持有者是线程 3。进一步搜索 “tid=3” 找到线程3, 发现它正在TimeWating。

    那么ANR的原因找到了:线程3持有了一把锁,并且自身长时间不释放,主线程等待这把锁发生超时。在线上环境中,常见因锁而ANR的场景是SharePreference写入。

    5.4 cpu被抢占

    CPU usage from 0ms to 10625ms later (2020-03-09 14:38:31.633 to 2020-03-09 14:38:42.257):
    543% 2045/com.alibaba.android.rimet: 54% user + 89% kernel / faults: 4608 minor 1 major ————关键行!!!
    99% 674/android.hardware.camera.provider@2.4-service: 81% user + 18% kernel / faults: 403 minor
    24% 32589/com.wang.test: 22% user + 1.4% kernel / faults: 7432 minor 1 major
    …省略N行…

    钉钉的进程,占据CPU高达543%,抢占了大部分CPU资源,因而导致发生ANR。

    5.5 内存紧张

    CPU和堆栈都很正常(不贴出来了),仍旧发生ANR,考虑是内存紧张。

    从CPU第一行信息可以发现,ANR的时间点是2020-10-31 22:38:58.468—CPU usage from 0ms to 21752ms later (2020-10-31 22:38:58.468 to 2020-10-31 22:39:20.220)

    接着去系统日志里搜索am_meminfo, 这个没有搜索到。再次搜索onTrimMemory,果然发现了很多条记录;

    10-31 22:37:19.749 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
    10-31 22:37:33.458 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
    10-31 22:38:00.153 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
    10-31 22:38:58.731 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
    10-31 22:39:02.816 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
    
    • 1
    • 2
    • 3
    • 4
    • 5

    可以看出,在发生ANR的时间点前后,内存都处于紧张状态,level等级是80,查看Android API 文档;

    /**
        * Level for {@link #onTrimMemory(int)}: the process is nearing the end
        * of the background LRU list, and if more memory isn't found soon it will
        * be killed.
        */
       static final int TRIM_MEMORY_COMPLETE = 80;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    可知80这个等级是很严重的,应用马上就要被杀死,被杀死的这个应用从名字可以看出来是桌面,连桌面都快要被杀死,那普通应用能好到哪里去呢?

    一般来说,发生内存紧张,会导致多个应用发生ANR,所以在日志中如果发现有多个应用一起ANR了,可以初步判定,此ANR与你的应用无关

    5.6 系统服务超时

    系统服务超时一般会包含BinderProxy.transactNative关键字,请看如下日志:

    "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x727851e8 self=0x78d7060e00
      | sysTid=4894 nice=0 cgrp=default sched=0/0 handle=0x795cc1e9a8
      | state=S schedstat=( 8292806752 1621087524 7167 ) utm=707 stm=122 core=5 HZ=100
      | stack=0x7febb64000-0x7febb66000 stackSize=8MB
      | held mutexes=
      kernel: __switch_to+0x90/0xc4
      kernel: binder_thread_read+0xbd8/0x144c
      kernel: binder_ioctl_write_read.constprop.58+0x20c/0x348
      kernel: binder_ioctl+0x5d4/0x88c
      kernel: do_vfs_ioctl+0xb8/0xb1c
      kernel: SyS_ioctl+0x84/0x98
      kernel: cpu_switch_to+0x34c/0x22c0
      native: #00 pc 000000000007a2ac  /system/lib64/libc.so (__ioctl+4)
      native: #01 pc 00000000000276ec  /system/lib64/libc.so (ioctl+132)
      native: #02 pc 00000000000557d4  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+252)
      native: #03 pc 0000000000056494  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
      native: #04 pc 00000000000562d0  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+216)
      native: #05 pc 000000000004ce1c  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
      native: #06 pc 00000000001281c8  /system/lib64/libandroid_runtime.so (???)
      native: #07 pc 0000000000947ed4  /system/framework/arm64/boot-framework.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+196)
      at android.os.BinderProxy.transactNative(Native method) ————————————————关键行!!!
      at android.os.BinderProxy.transact(Binder.java:804)
      at android.net.IConnectivityManager$Stub$Proxy.getActiveNetworkInfo(IConnectivityManager.java:1204)—关键行!
      at android.net.ConnectivityManager.getActiveNetworkInfo(ConnectivityManager.java:800)
      at com.xiaomi.NetworkUtils.getNetworkInfo(NetworkUtils.java:2)
      at com.xiaomi.frameworkbase.utils.NetworkUtils.getNetWorkType(NetworkUtils.java:1)
      at com.xiaomi.frameworkbase.utils.NetworkUtils.isWifiConnected(NetworkUtils.java:1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28

    从堆栈可以看出获取网络信息发生了ANR:getActiveNetworkInfo。

    前文有讲过:系统的服务都是Binder机制(16个线程),服务能力也是有限的,有可能系统服务长时间不响应导致ANR。如果其他应用占用了所有Binder线程,那么当前应用只能等待

    可进一步搜索:blockUntilThreadAvailable关键字:

    at android.os.Binder.blockUntilThreadAvailable(Native method)
    
    • 1

    如果有发现某个线程的堆栈,包含此字样,可进一步看其堆栈,确定是调用了什么系统服务。此类ANR也是属于系统环境的问题,如果某类型机器上频繁发生此问题,应用层可以考虑规避策略

    5.7对端block

    首先需要去看的就是traces.txt 中ANR进程的主线程的栈。
    以settting的这次ANR为例:

    "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500
      | sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00
      | state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100
      | stack=0xbe283000-0xbe285000 stackSize=8MB
      | held mutexes=
      native: #00 pc 000410ac  /system/lib/libc.so (__ioctl+8)
      native: #01 pc 000477e5  /system/lib/libc.so (ioctl+14)
      native: #02 pc 0001e7c5  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
      native: #03 pc 0001ee17  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38)
      native: #04 pc 0001efcd  /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124)
      native: #05 pc 00019fb7  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
      native: #06 pc 00086de9  /system/lib/libandroid_runtime.so (???)
      native: #07 pc 00d94629  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
      at android.os.BinderProxy.transactNative(Native method)
      at android.os.BinderProxy.transact(Binder.java:503)
      //此处block
      at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410)
      at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174)
      at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57)
      at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361)
      at android.app.Fragment.performCreate(Fragment.java:2202)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942)
      at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148)
      at android.app.BackStackRecord.run(BackStackRecord.java:793)
      at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535)
      at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562)
      at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084)
      at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657)
      at android.app.Activity.performCreate(Activity.java:6251)
      at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110)
      at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370)
      at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477)
      at android.app.ActivityThread.-wrap11(ActivityThread.java:-1)
      at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:148)
      at android.app.ActivityThread.main(ActivityThread.java:5438)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652)
      主线程被对端block,而对端是在systemserver中的NetworkPolicyManager。
      接下来继续去查NetworkPolicyManager为何会被block。
      通常情况下,可以搜索getNetworkPolicies,一般的对端的函数并不会修改函数名:
      看到我们的对端:
      "Binder_4" prio=5 tid=56 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200
      | sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930
      | state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100
      | stack=0x9fc04000-0x9fc06000 stackSize=1014KB
      | held mutexes=
      at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696)
      - waiting to lock <0x07439315> (a java.lang.Object) held by thread 35
      at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145)
      at android.os.Binder.execTransact(Binder.java:453
      被Tid=35的人拿住了一把锁(0x07439315),那么继续看tid=35是谁,有两种方法:
      1. 搜索tid=35
      2. 搜索0x07439315,找到  - locked <0x07439315> (a java.lang.Object)
      
      "NetworkPolicy" prio=5 tid=35 TimedWaiting
      | group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700
      | sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930
      | state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100
      | stack=0xa0e31000-0xa0e33000 stackSize=1038KB
      | held mutexes=
      at java.lang.Object.wait!(Native method)
      - waiting on <0x02580c1b> (a java.lang.Object)
      at java.lang.Thread.parkFor$(Thread.java:1220)
      - locked <0x02580c1b> (a java.lang.Object)
      at sun.misc.Unsafe.park(Unsafe.java:299)
      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
      at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634)
      at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340)
      at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712)
      - locked <0x0b0f91b8> (a java.lang.Object)
      at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421)
      at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232)
      at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060)
      - locked <0x07439315> (a java.lang.Object)
      at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881)
      at android.os.Handler.handleCallback(Handler.java:739)
      at android.os.Handler.dispatchMessage(Handler.java:95)
      at android.os.Looper.loop(Looper.java:148)
      at android.os.HandlerThread.run(HandlerThread.java:61)
      
      可以看到,NetworkPolicy在通过NativeDaemonConnector和netd通信(setInterfaceQuota)
      我们结合log来看下是否有有用信息,按照之前的经验,netd在执行完cmd的时候,会打印出slow operation。
      在发生ANR的前后,查找netd相关的打印:
      06-19 15:29:00.997  1235  1270 I am_anr  : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
      06-19 15:29:05.683  1235  2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms
      06-19 15:29:05.723  1235  2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100]
      从eventslog中还可以发现,setting出现了两次连续的ANR,而上下文中都有类似上面的dvm_lock_sample NetworkPolicyManagerService相关的告警。  
      因此虽然15:28的这次ANR并没有打出有用的栈,但是我们还是可以猜测出这两次ANR的原因都是netd的cmd耗时太久导致的(在主线程的调用路径上存在必然的block)
      那么,在netd可能无法修改的情况下,我们应该如何去resolve这个问题呢。
      将可能存在block的操作放到非UI线程中去做。
      
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100

    再分析一个

      打开dropbox中对应的system_app_anrxxxx:
         查看主线程的栈:
      "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
      | sysTid=25390 nice=-1 cgrp=default sched=3/0 handle=0xb6f18b34
      | state=S schedstat=( 0 0 0 ) utm=81 stm=12 core=2 HZ=100
      | stack=0xbe78b000-0xbe78d000 stackSize=8MB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/25390/stack)
      native: #00 pc 000422d0  /system/lib/libc.so (__ioctl+8)
      native: #01 pc 00047825  /system/lib/libc.so (ioctl+14)
      native: #02 pc 0001e835  /system/lib/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+132)
      native: #03 pc 0001ee93  /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+38)
      native: #04 pc 0001f049  /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
      native: #05 pc 00019fe3  /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
      native: #06 pc 0008a035  /system/lib/libandroid_runtime.so (???)
      native: #07 pc 00d78869  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
      at android.os.BinderProxy.transactNative(Native method)
      at android.os.BinderProxy.transact(Binder.java:510)
      at android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
      at android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
      at android.os.Environment.getExternalStorageDirectory(Environment.java:354)
      at com.huawei.common.utils.PathUtils.(PathUtils.java:51)
      at com.huawei.common.utils.PathUtils.getWorkspacePath(PathUtils.java:80)
      at com.huawei.common.components.log.Logger.(Logger.java:37)
      at com.huawei.common.components.log.Logger.i(Logger.java:162)
      at com.huawei.hwvplayer.data.db.DbProvider.attachInfo(DbProvider.java:89)
      at android.app.ActivityThread.installProvider(ActivityThread.java:5279)
      at android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
      at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
      at android.app.ActivityThread.access$1600(ActivityThread.java:165)
      at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:188)
      at android.app.ActivityThread.main(ActivityThread.java:5578)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
      
      主线程尝试去调用MountService的 getVolumeList接口,可能没有返回。
      查看system_server中相关的,搜索getVolumeList
      发现systemserver中有三个Binder线程和主线程被block,我们的对端是哪个Binder线程暂时无法确认,但这并不影响我们继续分析,因为他们被blockd的路径是一致的:
      
      
    "main" prio=5 tid=1 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
      | sysTid=22735 nice=-2 cgrp=default sched=0/0 handle=0xb6f18b34
      | state=S schedstat=( 0 0 0 ) utm=432 stm=85 core=1 HZ=100
      | stack=0xbe78b000-0xbe78d000 stackSize=8MB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
      at android.os.storage.StorageManager.getVolumeList(StorageManager.java:858)
      at android.os.storage.StorageManager.getPrimaryVolume(StorageManager.java:906)
      at com.android.server.usb.UsbDeviceManager.systemReady(UsbDeviceManager.java:327)
      at com.android.server.usb.UsbService.systemReady(UsbService.java:181)
      at com.android.server.usb.UsbService$Lifecycle.onBootPhase(UsbService.java:78)
      at com.android.server.SystemServiceManager.startBootPhase(SystemServiceManager.java:135)
      at com.android.server.SystemServer$3.run(SystemServer.java:1489)
      at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:12417)
      at com.android.server.am.HwActivityManagerService.systemReady(HwActivityManagerService.java:960)
      at com.android.server.SystemServer.startOtherServices(SystemServer.java:1485)
      at com.android.server.SystemServer.run(SystemServer.java:381)
      at com.android.server.SystemServer.main(SystemServer.java:272)
      at java.lang.reflect.Method.invoke!(Native method)
      at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
      
      "Binder_8" prio=5 tid=76 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x13bd60a0 self=0x9c1abe00
      | sysTid=25191 nice=-1 cgrp=default sched=0/0 handle=0x97158930
      | state=S schedstat=( 0 0 0 ) utm=9 stm=7 core=2 HZ=100
      | stack=0x9705c000-0x9705e000 stackSize=1014KB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
      at android.os.Binder.execTransact(Binder.java:453)
      
      "Binder_2" prio=5 tid=8 Blocked
      | group="main" sCount=1 dsCount=0 obj=0x12cac0a0 self=0xaebf0300
      | sysTid=22761 nice=-1 cgrp=default sched=0/0 handle=0xaef7d930
      | state=S schedstat=( 0 0 0 ) utm=41 stm=25 core=0 HZ=100
      | stack=0xaee81000-0xaee83000 stackSize=1014KB
      | held mutexes=
      at com.android.server.MountService.getVolumeList(MountService.java:2759)
      - waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
      at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
      at android.os.Binder.execTransact(Binder.java:453)
       
      他们均是被tid=40的人block,按照上面的方法搜索tid=40或者 0x0eeb54f1得到block的人:
      "MountService" prio=5 tid=40 TimedWaiting
      | group="main" sCount=1 dsCount=0 obj=0x132c1160 self=0x9ce57400
      | sysTid=23512 nice=0 cgrp=default sched=0/0 handle=0x9a239930
      | state=S schedstat=( 0 0 0 ) utm=29 stm=2 core=2 HZ=100
      | stack=0x9a137000-0x9a139000 stackSize=1038KB
      | held mutexes=
      at java.lang.Object.wait!(Native method)
      - waiting on <0x00fea1f3> (a java.lang.Object)
      at java.lang.Thread.parkFor$(Thread.java:1235)
      - locked <0x00fea1f3> (a java.lang.Object)
      at sun.misc.Unsafe.park(Unsafe.java:299)
      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
      at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:777)
      at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:489)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
      at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
      at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
      at com.android.server.MountService.handleSystemReady(MountService.java:776)
      - locked <0x0eeb54f1> (a java.lang.Object)
      at com.android.server.MountService.access$500(MountService.java:152)
      at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
      at android.os.Handler.dispatchMessage(Handler.java:102)
      at android.os.Looper.loop(Looper.java:150)
      at android.os.HandlerThread.run(HandlerThread.java:61)
     
      和上面netd类似的,mountservice也是通过ndc和vold通信,这里我们需要继续查看是否vold存在异常。
      之前提到过,类似这种同步锁block的,dvm_lock_sample一定会有打印,于是先去找eventslog,不过这个是华为的log,是没有eventlog的。
      而华为实现了一个blockMonitor的功能,和dvm_lock_sample类似,当某个操作特别耗时的时候,会将其打印出来:
      在ANR的附近找到如下:
     07-19 10:17:50.739 25271 25271 W BlockMonitor: The binder calling took 55209ms.
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BlockMonitor.checkBinderTime(BlockMonitor.java:141)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BinderProxy.transact(Binder.java:511)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment.getExternalStorageDirectory(Environment.java:354)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.parsePathStrategy(FileProvider.java:583)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.getPathStrategy(FileProvider.java:534)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.attachInfo(FileProvider.java:352)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installProvider(ActivityThread.java:5279)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.access$1600(ActivityThread.java:165)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Handler.dispatchMessage(Handler.java:102)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Looper.loop(Looper.java:188)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.main(ActivityThread.java:5578)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: java.lang.reflect.Method.invoke(Native Method)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
    07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
    再加上之前的:
    07-19 10:17:50.729 22735 23512 E NativeDaemonConnector.ResponseQueue: Timeout waiting for response
    07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response to 4 volume reset
    07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response mOutputStream = android.net.LocalSocketImpl$SocketOutputStream@922644, mSocket = vold
    07-19 10:17:50.731 22735 23512 W MountService: Failed to reset vold
    07-19 10:17:50.731 22735 23512 W MountService: com.android.server.NativeDaemonTimeoutException: command '4 volume reset' failed with 'null'
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:501)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.handleSystemReady(MountService.java:776)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService.access$500(MountService.java:152)
    07-19 10:17:50.731 22735 23512 W MountService: 	at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.Handler.dispatchMessage(Handler.java:102)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.Looper.loop(Looper.java:150)
    07-19 10:17:50.731 22735 23512 W MountService: 	at android.os.HandlerThread.run(HandlerThread.java:61)
    我们有理由去推断vold的状态是不对的,但是又没有vold的栈。
    那么全局grep vold试试能不能找到线索,在kernellog中发现了vold的异常:
    07-19 14:22:24.669 <3>[10772.492156] c0 Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
    07-19 14:22:24.669 <6>[10772.492217] c0 vold            R running      0   224      1 0x00000001
    07-19 14:22:24.669 <4>[10772.492278] c0 [] (__schedule+0x38c/0x5bc) from [] (schedule_timeout+0x18/0x1e8)
    07-19 14:22:24.669 <4>[10772.492309] c0 [] (schedule_timeout+0x18/0x1e8) from [] (wait_for_common+0x11c/0x164)
    07-19 14:22:24.669 <4>[10772.492309] c0 [] (wait_for_common+0x11c/0x164) from [] (mmc_wait_for_req+0xb4/0xe4)
    07-19 14:22:24.669 <4>[10772.492339] c0 [] (mmc_wait_for_req+0xb4/0xe4) from [] (mmc_wait_for_cmd+0x64/0x74)
    07-19 14:22:24.669 <4>[10772.492370] c0 [] (mmc_wait_for_cmd+0x64/0x74) from [] (mmc_send_status+0x6c/0x8c)
    07-19 14:22:24.670 <4>[10772.492400] c0 [] (mmc_send_status+0x6c/0x8c) from [] (sd_send_status+0x14/0x44)
    07-19 14:22:24.670 <4>[10772.492431] c0 [] (sd_send_status+0x14/0x44) from [] (mmc_lock_unlock_by_buf+0xac/0x168)
    07-19 14:22:24.670 <4>[10772.492431] c0 [] (mmc_lock_unlock_by_buf+0xac/0x168) from [] (mmc_lockable_store+0x594/0x75c)
    07-19 14:22:24.670 <4>[10772.492461] c0 [] (mmc_lockable_store+0x594/0x75c) from [] (dev_attr_store+0x18/0x24)
    07-19 14:22:32.070 <4>[10772.492492] c0 [] (dev_attr_store+0x18/0x24) from [] (sysfs_write_file+0x104/0x148)
    07-19 14:22:32.070 <4>[10772.492522] c0 [] (sysfs_write_file+0x104/0x148) from [] (vfs_write+0xd0/0x180)
    07-19 14:22:32.070 <4>[10772.492553] c0 [] (vfs_write+0xd0/0x180) from [] (SyS_write+0x38/0x68)
    07-19 14:22:32.071 <4>[10772.492583] c0 [] (SyS_write+0x38/0x68) from [] (ret_fast_syscall+0x0/0x30)
    vold一直在这个操作中没有退出来,所以不能响应客户端的请求,从而导致了ANR。
    这个问题需要mmc的同事进一步去分析,目前怀疑是SD卡发生了错误。
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180

    5.8系统资源紧张造成的anr

    通常情况下,一般这种问题的表现的可能形式是

    1. traces.txt中的主线程的栈在一个非常common的操作中,如new一个变量,读取某个文件等
    2. ANR的trace中IOW相当高(IOW+CPU sys+usr=100)
    3. Kswapd非常活跃
    4. Low Ram 配置
    5. kernel log中 D状态的用户进程比较多,且都block在内存相关,且内存比较紧张.
    6. 连续出现ANR,slog和ams一直在dumptrace,造成比较大的IO压力
    7. 前台正在玩一个很大的游戏
    8. 某个应用正在安装做dex2oat.

    出现这种情况下,我们需要做的是:

    1. 针对性的调整lmk参数,减少后台的数量
    2. 检查是否有不应该常驻的进程
    3. 是否需要修改伪前台的adj,减少常驻内存
    4. 是否需要关闭dex2oat

    6 手机厂商增添的日志

    6.1 binder_sample

    1. 功能说明: 监控每个进程的主线程的binder transaction的耗时情况, 当超过阈值时,则输出相应的目标调用信息,默认1000ms打开。
    2. log格式: 52004 binder_sample (descriptor|3),(method_num|1|5),(time|1|3),(blocking_package|3),(sample_percent|1|6)
    3. log实例: 2754 2754 I binder_sample: [android.app.IActivityManager,35,2900,android.process.media,5]

    可以 看出

    1. 主线程2754;
    2. 执行android.app.IActivityManager接口
    3. 所对应方法code =35(即STOP_SERVICE_TRANSACTION),
    4. 所花费时间为2900ms.
    5. 该block所在package为 android.process.media,最后一个参数是sample比例(没有太大价值)

    6.2 dvm_lock_sample

    1. 功能说明: 当某个线程等待lock的时间blocked超过阈值,则输出当前的持锁状态
    2. log格式: 20003 dvm_lock_sample (process|3),(main|1|5),(thread|3),(time|1|3),(file|3),(line|1|5),(ownerfile|3),(ownerline|1|5),(sample_percent|1|6)
    3. log实例 dvm_lock_sample: [system_server,1,Binder_9,1500,ActivityManagerService.java,6403,-,1448,0]

    意思是:system_server: Binder_9,执行到ActivityManagerService.java的6403行代码,一直在等待AMS锁, 而该锁所同一文件的1448行代码所持有, 从而导致Binder_9线程被阻塞1500ms.

    6.3 binder starved

    • 功能说明: 当system_server等进程的线程池使用完, 无空闲线程时, 则binder通信都处于饥饿状态, 则饥饿状态超过一定阈值则输出信息;
    • 云控参数: persist.sys.binder.starvation (默认值16ms)
    • log实例 1232 1232 "binder thread pool (16 threads) starved for 100 ms"
    • 意思是 system_server进程的 线程池已满的持续长达100ms

    7 总结

    1. trace中的线程状态
      1. main线程处于 BLOCK、WAITING、TIMEWAITING状态,那基本上是函数阻塞导致ANR
      2. main线程无异常,则应该排查CPU负载和内存环境
  • 相关阅读:
    《HTML表单》
    大数据必学Java基础(十):标识符和关键字
    可信执行环境(Tee)入门综述
    【国科大卜算】Moving Tables
    Day 53 前端开发 jQuery
    html静态网站基于HTML+CSS+JavaScript上海美食介绍网站网页设计与实现共计5个页面
    数据开源 | Magic Data中英文座舱功能点泛化语料
    LeetCode[946]验证栈序列
    PCI bar 解析
    Haproxy
  • 原文地址:https://blog.csdn.net/gangjindianzi/article/details/134067807