• S3 Client:Timeout waiting for connection from pool 问题追踪


    目录

    背景

    解决过程

    前置工作

    指标收集

    ​编辑

    问题解决 

    总结


    背景

    在处理ceph上的文件时,出现如下报错

    解决过程

    上github搜了相关issue,其中该issue是因为采取并行处理读取ceph,打满链接池,而本地逻辑是串行,并不适用,但其提及从监控面板看链接池相关属性,用于辅助排查,而由于本地没有cloudwatch相关依赖,只能从本地监控面板观察情况

    前置工作

    查询相关资料,发现指标类

    com.amazonaws.util.AWSRequestMetricsFullSupport

    通过以下方法进行收集指标

    com.amazonaws.http.AmazonHttpClient.RequestExecutor#executeOneRequest

    通过以下方法来进行指标信息传递

    com.amazonaws.AmazonWebServiceClient#endClientExecution(com.amazonaws.util.AWSRequestMetrics, com.amazonaws.Request, com.amazonaws.Response, boolean)

    其中链接池信息是通过以下方法来进行传递收集到指标的counter

    com.amazonaws.http.AmazonHttpClient.RequestExecutor#captureConnectionPoolMetrics

    以getObjectRequest为例,下面为时序图

     其中标红的类可以自定义逻辑,比如收集指标并上传到prometheus,用于排查问题

    而是否开启指标监控由以下方法决定,可以看到request或者client有自定义的指标收集器的话,便会使用AWSRequestMetricsFullSupport 来做为收集器的实现,默认使用AWSRequestMetrics

    com.amazonaws.AmazonWebServiceClient#createExecutionContext(com.amazonaws.AmazonWebServiceRequest, com.amazonaws.internal.auth.SignerProvider)

    通过以下方法去寻找收集器,优先级为request>client>sdk,默认实现为什么都不做(RequestMetricCollector#NONE)

    com.amazonaws.AmazonWebServiceClient#findRequestMetricCollector

    指标收集

    如上所示,可以在构建client时统一添加收集器逻辑

    1. s3Client = AmazonS3ClientBuilder.standard()
    2. .withMetricsCollector(new RequestMetricCollector() {
    3. @Override
    4. public void collectMetrics(Request request, Response response) {
    5. //do something
    6. AWSRequestMetrics awsRequestMetrics = request.getAWSRequestMetrics();
    7. TimingInfo timingInfo = awsRequestMetrics.getTimingInfo();
    8. Map allCounters = timingInfo.getAllCounters();
    9. Map> subMeasurementsByName = timingInfo.getSubMeasurementsByName();
    10. System.out.println("finish");
    11. }
    12. })
    13. .build();

    也可以为特定请求进行指标收集

    1. client.getObject(new GetObjectRequest(bucketName, fileName).withRequestMetricCollector(new RequestMetricCollector() {
    2. @Override
    3. public void collectMetrics(Request request, Response response) {
    4. }
    5. }));

    可以发现有如下指标,具体解释

     根据业务来收集上报需要的指标

     以下是demo收集示例

    1. @Slf4j
    2. public class S3MetricCollector extends RequestMetricCollector {
    3. public static Set DEFAULT_TIME_INFO = new HashSet<>();
    4. static {
    5. DEFAULT_TIME_INFO.add("HttpRequestTime");
    6. DEFAULT_TIME_INFO.add("HttpClientReceiveResponseTime");
    7. }
    8. private static volatile String INSTANCE_NAME = null;
    9. private Gauge httpGauge = null;
    10. private Gauge connectionGauge = null;
    11. private Counter counter = null;
    12. private static final S3MetricCollector INSTANCE = new S3MetricCollector();
    13. private static final ThreadLocal> TYPE_SET = ThreadLocal.withInitial(HashSet::new);
    14. private S3MetricCollector() {
    15. httpGauge = Gauge.build().name("s3_client_http_info").labelNames("machine_id", "env", "counter").help("ceph客户端排查").register();
    16. connectionGauge = Gauge.build().name("s3_client_connection_info").labelNames("machine_id", "env", "properties", "type").help("链接相关").register();
    17. counter = Counter.build("s3_client_request", "s3 client返回状态").labelNames("machine_id", "env", "status", "type").register();
    18. }
    19. public static S3MetricCollector getINSTANCE() {
    20. return INSTANCE;
    21. }
    22. public static void finish(AmazonS3 s3Client) {
    23. getINSTANCE().doFinish(s3Client);
    24. }
    25. private void doFinish(AmazonS3 s3Client) {
    26. String env = getEnv();
    27. try {
    28. Field field = AmazonWebServiceClient.class.getDeclaredField("client");
    29. field.setAccessible(true);
    30. AmazonHttpClient amazonHttpClient = (AmazonHttpClient) field.get(s3Client);
    31. Field httpClient = AmazonHttpClient.class.getDeclaredField("httpClient");
    32. httpClient.setAccessible(true);
    33. ConnectionManagerAwareHttpClient o = (ConnectionManagerAwareHttpClient) httpClient.get(amazonHttpClient);
    34. ConnPoolControl httpClientConnectionManager = (ConnPoolControl) o.getHttpClientConnectionManager();
    35. PoolStats totalStats = httpClientConnectionManager.getTotalStats();
    36. httpGauge.labels(INSTANCE_NAME, env
    37. , HttpClientPoolAvailableCount.name()
    38. ).set(totalStats.getAvailable());
    39. httpGauge.labels(INSTANCE_NAME, env
    40. , HttpClientPoolLeasedCount.name()
    41. ).set(totalStats.getLeased());
    42. httpGauge.labels(INSTANCE_NAME, env
    43. , HttpClientPoolPendingCount.name()
    44. ).set(totalStats.getPending());
    45. for (String s : DEFAULT_TIME_INFO) {
    46. for (String type : TYPE_SET.get()) {
    47. connectionGauge.labels(INSTANCE_NAME, env, s, type).set(0.0);
    48. }
    49. }
    50. } catch (IllegalAccessException e) {
    51. throw new RuntimeException(e);
    52. } catch (NoSuchFieldException e) {
    53. throw new RuntimeException(e);
    54. } finally {
    55. TYPE_SET.remove();
    56. }
    57. }
    58. @Override
    59. public void collectMetrics(Request request, Response response) {
    60. AWSRequestMetricsFullSupport awsRequestMetrics = (AWSRequestMetricsFullSupport) request.getAWSRequestMetrics();
    61. TimingInfo timingInfo = awsRequestMetrics.getTimingInfo();
    62. Map allCounters = timingInfo.getAllCounters();
    63. String env = getEnv();
    64. String instanceName = getInstanceName();
    65. String type = getValue(awsRequestMetrics.getProperty(AWSRequestMetrics.Field.RequestType));
    66. for (Map.Entry entry : allCounters.entrySet()) {
    67. httpGauge.labels(instanceName, env
    68. , entry.getKey()
    69. ).set(Double.parseDouble(entry.getValue().toString()));
    70. }
    71. Set typeSet = TYPE_SET.get();
    72. if (typeSet == null) {
    73. typeSet = new HashSet<>();
    74. }
    75. if (typeSet.add(type)) {
    76. TYPE_SET.set(typeSet);
    77. }
    78. for (String s : DEFAULT_TIME_INFO) {
    79. TimingInfo subMeasurement = timingInfo.getSubMeasurement(s);
    80. connectionGauge.labels(instanceName, env, s, type).set(Optional.ofNullable(subMeasurement).map(TimingInfo::toString).map(Double::parseDouble).orElse(0.0));
    81. }
    82. counter.labels(instanceName, env, response.getHttpResponse().getStatusCode() + "", type).inc();
    83. }
    84. private static String getEnv() {
    85. String env = "local";
    86. try {
    87. env = SpringContextUtil.getProperties("spring.profiles.active");
    88. } catch (Exception e) {
    89. }
    90. return env;
    91. }
    92. public static void setInfo(AmazonWebServiceRequest serviceRequest) {
    93. serviceRequest.withRequestMetricCollector(S3MetricCollector.getINSTANCE());
    94. }
    95. public String getValue(List values) {
    96. if (values == null) {
    97. return "";
    98. }
    99. return values.stream().map(Object::toString).collect(Collectors.joining(","));
    100. }
    101. protected static String getInstanceName() {
    102. if (INSTANCE_NAME == null) {
    103. synchronized (S3MetricCollector.class) {
    104. if (INSTANCE_NAME == null) {
    105. String port = SpringContextUtil.getProperties("server.port");
    106. try {
    107. InetAddress addr = InetAddress.getLocalHost();
    108. INSTANCE_NAME = addr.getHostName() + ":" + port;
    109. } catch (UnknownHostException e) {
    110. log.error("get host error", e);
    111. INSTANCE_NAME = "unknown:" + port;
    112. }
    113. }
    114. }
    115. }
    116. return INSTANCE_NAME;
    117. }
    118. }
    119.  收集并上报到prometheus,并通过grafana进行查看,发现确实是由于链接数过多,占满了链接池(默认最大值为50),而导致后续获取超时,获取http链接逻辑为以下方法

      org.apache.http.pool.AbstractConnPool#getPoolEntryBlocking

      问题解决 

      可以看到占用链接一直增涨,下一步便是找到链接占用过多的原因,目前了解到的有如下原因 

      1.并发度高,确实打满链接池

      2.某些链接没释放

      显然属于第二点,找到占用链接数开始升高的时间节点,并查看相关的运行日志,发现确实有相关异常没处理,导致流没有关闭,链接也一直占用。

      如下图,修正后没有出现该现象

      总结

      s3客户端提供相对丰富的指标辅助问题排查,通过自定义的指标上报,找出异常情况的时间节点,结合日志进行排查,相对方便地解决了问题,也可以做一些预警,比如请求的增速,链接占用数,慢操作等

    120. 相关阅读:
      Wespeaker框架数据集准备(1)
      SQL Server对象类型(3)——4.3.视图(View)
      【滑动窗口】滑动窗口思想介绍及习题案例
      Android开发基础——Activity基本用法
      Hive:BUG记录,错误使用动态分区导致的插入失败
      【函数式编程】Lambda、Stream、Optional、方法引用、并行流
      [附源码]Python计算机毕业设计Django二次元信息分享平台的设计及实现
      js数组根据同一id进行分组
      【首因效应】第一印象
      Flutter实战-请求封装(六)之设置抓包Proxy
    121. 原文地址:https://blog.csdn.net/weixin_37703281/article/details/127815753