• victoriaMetrics无法获取抓取target的问题


    victoriaMetrics无法获取抓取target的问题

    问题描述

    最近在新环境中部署了一个服务,其暴露的指标路径为:10299/metrics,配置文件如下(名称字段有修改):

    apiVersion: v1
    items:
    - apiVersion: operator.victoriametrics.com/v1beta1
      kind: VMServiceScrape
      metadata:
        labels:
          app_id: audit
        name: audit
        namespace: default
      spec:
        endpoints:
        - path: /metrics
          targetPort: 10299
        namespaceSelector:
          matchNames:
          - default
        selector:
          matchLabels:
            app_id: audit
    

    但在vmagent上查看其状态如下,vmagent无法发现该target:

    一般排查方式

    1. 确保服务本身没问题,可以通过${podIp}:10299/metrics访问到指标
    2. 确保vmservicescrape-->service-->enpoints链路是通的,即配置的selector字段能够正确匹配到对应的资源
    3. 确保vmservicescrape格式正确。注:vmservicescrape资源格式不正确可能会导致vmagent无法加载配置,可以通过第5点检测到
    4. 确保vmagent中允许发现该命名空间中的target
    5. 在vmagent的UI界面执行reload,查看vmagent的日志是否有相关错误提示

    经过排查发现上述方式均无法解决问题,更奇怪的是在vmagent的api/v1/targets中无法找到该target,说明vmagent压根没有发现该服务,即vmservicescrape配置没有生效。在vmagent中查看上述vmservicescrape生成的配置文件如下(其拼接了静态配置),可以看到它使用了kubernetes_sd_configs的方式来发现target:

    - job_name: serviceScrape/default/audit/0
      metrics_path: /metrics
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_id]
        regex: audit
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "10299"
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        target_label: node
        regex: Node;(.*)
        replacement: ${1}
      - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
        separator: ;
        target_label: pod
        regex: Pod;(.*)
        replacement: ${1}
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        target_label: container
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
      - source_labels: [__meta_kubernetes_service_name]
        target_label: job
        replacement: ${1}
      - target_label: endpoint
        replacement: "8080"
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          own_namespace: false
          names:
          - default
    

    代码分析

    既然配置没有问题,那只能通过victoriametrics的kubernetes_sd_configs的运作方式看下到底是哪里出问题了。在victoriametrics的源码可以看到其拼接的target url如下:

    scrapeURL := fmt.Sprintf("%s://%s%s%s%s", schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)
    

    其中:

    • schemeRelabeled:默认是http
    • metricsPathRelabeled:即生成的配置文件的metrics_path字段
    • optionalQuestionparamsStr没有配置,可以忽略

    最主要的字段就是addressRelabeled,它来自一个名为"__address__"的标签

    func mergeLabels(swc *scrapeWorkConfig, target string, extraLabels, metaLabels map[string]string) []prompbmarshal.Label {
    	...
    	m["job"] = swc.jobName
    	m["__address__"] = target
    	m["__scheme__"] = swc.scheme
    	m["__metrics_path__"] = swc.metricsPath
    	m["__scrape_interval__"] = swc.scrapeInterval.String()
    	m["__scrape_timeout__"] = swc.scrapeTimeout.String()
    	...
    }
    

    继续跟踪代码,可以看到该标签是通过sc.KubernetesSDConfigs[i].MustStart获取到的,从KubernetesSDConfigs的名称上看,它就是负责处理kubernetes_sd_configs机制的:

    func (sc *ScrapeConfig) mustStart(baseDir string) {
    	swosFunc := func(metaLabels map[string]string) interface{} {
    		target := metaLabels["__address__"]
    		sw, err := sc.swc.getScrapeWork(target, nil, metaLabels)
    		if err != nil {
    			logger.Errorf("cannot create kubernetes_sd_config target %q for job_name %q: %s", target, sc.swc.jobName, err)
    			return nil
    		}
    		return sw
    	}
    	for i := range sc.KubernetesSDConfigs {
    		sc.KubernetesSDConfigs[i].MustStart(baseDir, swosFunc)
    	}
    }
    

    继续往下看,看看这个"__address__"字段到底是什么,函数调用如下:

    MustStart --> cfg.aw.mustStart --> aw.gw.startWatchersForRole --> uw.reloadScrapeWorksForAPIWatchersLocked --> o.getTargetLabels

    最后一个函数getTargetLabels是个接口方法

    type object interface {
    	key() string
    
    	// getTargetLabels must be called under gw.mu lock.
    	getTargetLabels(gw *groupWatcher) []map[string]string
    }
    

    getTargetLabels的实现如下,这就是kubernetes_sd_configs的各个role的具体实现。上述服务用到的是kubernetes_sd_configsrole为endpoints

    实现如下:

    func (eps *Endpoints) getTargetLabels(gw *groupWatcher) []map[string]string {
    	var svc *Service
    	if o := gw.getObjectByRoleLocked("service", eps.Metadata.Namespace, eps.Metadata.Name); o != nil {
    		svc = o.(*Service)
    	}
    	podPortsSeen := make(map[*Pod][]int)
    	var ms []map[string]string
    	for _, ess := range eps.Subsets {
    		for _, epp := range ess.Ports {
    			ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.Addresses, epp, svc, "true")
    			ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.NotReadyAddresses, epp, svc, "false")
    		}
    	}
    	// See https://kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
    	// and https://github.com/kubernetes/kubernetes/pull/99975
    	switch eps.Metadata.Annotations.GetByName("endpoints.kubernetes.io/over-capacity") {
    	case "truncated":
    		logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead`, eps.Metadata.key())
    	case "warning":
    		logger.Warnf(`the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead`, eps.Metadata.key())
    	}
    
    	// Append labels for skipped ports on seen pods.
    	portSeen := func(port int, ports []int) bool {
    		for _, p := range ports {
    			if p == port {
    				return true
    			}
    		}
    		return false
    	}
    	for p, ports := range podPortsSeen {
    		for _, c := range p.Spec.Containers {
    			for _, cp := range c.Ports {
    				if portSeen(cp.ContainerPort, ports) {
    					continue
    				}
    				addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
    				m := map[string]string{
    					"__address__": addr,
    				}
    				p.appendCommonLabels(m)
    				p.appendContainerLabels(m, c, &cp)
    				if svc != nil {
    					svc.appendCommonLabels(m)
    				}
    				ms = append(ms, m)
    			}
    		}
    	}
    	return ms
    }
    

    可以看到,"__address__"其实就是拼接了p.Status.PodIPcp.ContainerPort,而p则代表一个kubernetes的pod数据结构,因此要求:

    1. pod状态是running的,且能够正确分配到PodIP
    2. p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的端口

    问题解决

    鉴于上述分析,查看了一下环境中的deployment,发现该deployment只配置了8080端口,并没有配置暴露指标的端口10299。问题解决。

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app_id: audit
      name: audit
      namespace: default
    spec:
      ...
      template:
        metadata:
          ...
        spec:
          containers:
          - env:
            - name: APP_ID
              value: audit
            ports:
            - containerPort: 8080
              protocol: TCP
              ...
    

    总结

    kubernetes_sd_configs方式其实就是通过listwatch的方式获取对应role的配置,然后拼接出target的__address__,此外它还会暴露一些额外的指标,如:

    • __meta_kubernetes_endpoint_hostname: Hostname of the endpoint.
    • __meta_kubernetes_endpoint_node_name: Name of the node hosting the endpoint.
    • __meta_kubernetes_endpoint_ready: Set to true or false for the endpoint's ready state.
    • __meta_kubernetes_endpoint_port_name: Name of the endpoint port.
    • __meta_kubernetes_endpoint_port_protocol: Protocol of the endpoint port.
    • __meta_kubernetes_endpoint_address_target_kind: Kind of the endpoint address target.
    • __meta_kubernetes_endpoint_address_target_name: Name of the endpoint address target.
  • 相关阅读:
    【MySQL基础】常用指令详解
    熊市下的Coinbase:亏损、裁员、股价暴跌
    Spring中Bean循环依赖详解
    Go学习笔记1
    38.企业快速开发平台Spring Cloud+Spring Boot+Mybatis之Highcharts 使用百分比的堆叠柱形图
    Spring中@Component和@Bean的区别
    VMware:一个多云+AI的未来
    【单链表,循环链表和双向链表的时间效率比较,顺序表和链表的比较,有序表的合并------用顺序表实现,用链表实现】
    Linux之分区【详细总结】
    Java项目:ssm停车位租赁系统
  • 原文地址:https://www.cnblogs.com/charlieroro/p/16245402.html