• 搭建 Hadoop 生态集群大数据监控告警平台


    目录

    一、部署 prometheus 环境

    1.1 下载安装包

    1.2 解压安装

    1.3 修改配置文件

    1.3.1 hadoop-env.sh

    1.3.2 prometheus_config.yml

    1.3.3 zkServer.sh

    1.3.4 prometheus_zookeeper.yaml

    1.3.5 alertmanager.yml 

    1.3.6 prometheus.yml 

    1.3.7 config.yml 

    1.3.8 template.tmpl 

    1.3.9 告警规则

    1.3.10 /etc/profile

    1.3.11 prometheus_spark.yml

    1.3.12 spark-env.sh

    1.3.13 hive 

    1.3.14 prometheus_metastore.yaml、prometheus_hs2.yaml 

    1.4 创建 systemd 服务

    1.4.1 创建 prometheus 用户

    1.4.2 alertmanager.service 

    1.4.3 prometheus.service 

    1.4.4 node_exporter.service 

    1.4.5 pushgateway.service 

    1.4.6 grafana.service 

    1.5 启动服务

    二、补充

    2.1 告警规则和 grafana 仪表盘文件下载

    2.2 服务进程监控脚本

    2.2.1 prometheus-webhook-dingtalk

    2.2.2 hive 

    2.2.3 日志切割 

    2.3 常用命令 

    2.4 参考文档

    2.5 内网环境

    2.5.1 nginx(公网)

    2.5.2 config 

    2.5.3 /etc/hosts 


    Hadoop 集群规模:Hadoop YARN HA 集群安装部署详细图文教程_Stars.Sky的博客-CSDN博客

    Spark 集群规模:Spark-3.2.4 高可用集群安装部署详细图文教程_Stars.Sky的博客-CSDN博客

    IP

    主机名

    运行角色

    192.168.170.136

    hadoop01

    namenode datanode resourcemanager nodemanager JournalNode DFSZKFailoverController QuorumPeerMain spark hive

    192.168.170.137

    hadoop02

    namenode datanode resourcemanager nodemanager JournalNode DFSZKFailoverController QuorumPeerMain spark

    192.168.170.138

    hadoop03

    datanode nodemanage JournalNode QuorumPeerMain spark

    一、部署 prometheus 环境

    1.1 下载安装包

    • prometheus、alertmanager、pushgateway、node_exporter:https://prometheus.io/download/

    • prometheus-webhook-dingtalk:https://github.com/timonwong/prometheus-webhook-dingtalk/tree/main

    • grafana:https://grafana.com/grafana/download

    • jmx_exporter:https://github.com/prometheus/jmx_exporter

    1.2 解压安装

            新建一个 /monitor 目录,把上面下载的 tar.gz 包都解压安装在 /monitor 目录下,并重命名如下名字:

    1.3 修改配置文件

    1.3.1 hadoop-env.sh

    修改完后要把这个文件 scp 给各个 Hadoop 节点!

    1. [root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop/
    2. [root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4/etc/hadoop]# vim hadoop-env.sh
    3. if ! grep -q <<<"$HDFS_NAMENODE_OPTS" jmx_prometheus_javaagent; then
    4. HDFS_NAMENODE_OPTS="$HDFS_NAMENODE_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30002:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    5. fi
    6. if ! grep -q <<<"$HDFS_DATANODE_OPTS" jmx_prometheus_javaagent; then
    7. HDFS_DATANODE_OPTS="$HDFS_DATANODE_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30003:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    8. fi
    9. if ! grep -q <<<"$YARN_RESOURCEMANAGER_OPTS" jmx_prometheus_javaagent; then
    10. YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30004:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    11. fi
    12. if ! grep -q <<<"$YARN_NODEMANAGER_OPTS" jmx_prometheus_javaagent; then
    13. YARN_NODEMANAGER_OPTS="$YARN_NODEMANAGER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30005:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    14. fi
    15. if ! grep -q <<<"$HDFS_JOURNALNODE_OPTS" jmx_prometheus_javaagent; then
    16. HDFS_JOURNALNODE_OPTS="$HDFS_JOURNALNODE_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30006:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    17. fi
    18. if ! grep -q <<<"$HDFS_ZKFC_OPTS" jmx_prometheus_javaagent; then
    19. HDFS_ZKFC_OPTS="$HDFS_ZKFC_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30007:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    20. fi
    21. if ! grep -q <<<"$HDFS_HTTPFS_OPTS" jmx_prometheus_javaagent; then
    22. HDFS_HTTPFS_OPTS="$HDFS_HTTPFS_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30008:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    23. fi
    24. if ! grep -q <<<"$YARN_PROXYSERVER_OPTS" jmx_prometheus_javaagent; then
    25. YARN_PROXYSERVER_OPTS="$YARN_PROXYSERVER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30009:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    26. fi
    27. if ! grep -q <<<"$MAPRED_HISTORYSERVER_OPTS" jmx_prometheus_javaagent; then
    28. MAPRED_HISTORYSERVER_OPTS="$MAPRED_HISTORYSERVER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30010:/bigdata/hadoop/server/hadoop-3.2.4/prometheus_config.yml"
    29. fi

    1.3.2 prometheus_config.yml

    修改完后要把这个文件 scp 给各个 Hadoop 节点!

    1. [root@hadoop01 ~]# cd /bigdata/hadoop/server/hadoop-3.2.4/
    2. [root@hadoop01 /bigdata/hadoop/server/hadoop-3.2.4]# vim prometheus_config.yml
    3. rules:
    4. - pattern: ".*"

    1.3.3 zkServer.sh

    修改完后要把这个文件 scp 给各个 zookeeper 节点!

    1. [root@hadoop01 ~]# cd /bigdata/hadoop/zookeeper/zookeeper-3.7.1/bin/
    2. [root@hadoop01 /bigdata/hadoop/zookeeper/zookeeper-3.7.1/bin]# vim zkServer.sh
    3. if [ "x$JMXLOCALONLY" = "x" ]
    4. then
    5. JMXLOCALONLY=false
    6. fi
    7. JMX_DIR="/monitor"
    8. JVMFLAGS="$JVMFLAGS -javaagent:$JMX_DIR/jmx_prometheus_javaagent-0.19.0.jar=30011:/bigdata/hadoop/zookeeper/zookeeper-3.7.1/prometheus_zookeeper.yaml"

    1.3.4 prometheus_zookeeper.yaml

    修改完后要把这个文件 scp 给各个 zookeeper 节点!

    1. [root@hadoop01 ~]# cd /bigdata/hadoop/zookeeper/zookeeper-3.7.1/
    2. [root@hadoop01 /bigdata/hadoop/zookeeper/zookeeper-3.7.1]# vim prometheus_zookeeper.yaml
    3. rules:
    4. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    5. name: "zookeeper_$2"
    6. type: GAUGE
    7. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    8. name: "zookeeper_$3"
    9. type: GAUGE
    10. labels:
    11. replicaId: "$2"
    12. - pattern: "org.apache.ZooKeeperService<>(Packets\\w+)"
    13. name: "zookeeper_$4"
    14. type: COUNTER
    15. labels:
    16. replicaId: "$2"
    17. memberType: "$3"
    18. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    19. name: "zookeeper_$4"
    20. type: GAUGE
    21. labels:
    22. replicaId: "$2"
    23. memberType: "$3"
    24. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    25. name: "zookeeper_$4_$5"
    26. type: GAUGE
    27. labels:
    28. replicaId: "$2"
    29. memberType: "$3"
    30. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    31. type: GAUGE
    32. name: "zookeeper_$2"
    33. - pattern: "org.apache.ZooKeeperService<>(\\w+)"
    34. type: GAUGE
    35. name: "zookeeper_$2"

    1.3.5 alertmanager.yml 

    1. [root@hadoop01 ~]# cd /monitor/alertmanager/
    2. [root@hadoop01 /monitor/alertmanager]# ls
    3. alertmanager alertmanager.yml amtool data LICENSE NOTICE
    4. [root@hadoop01 /monitor/alertmanager]# vim alertmanager.yml
    5. global:
    6. resolve_timeout: 5m
    7. templates:
    8. - '/monitor/prometheus-webhook-dingtalk/contrib/templates/legacy/*.tmpl'
    9. route:
    10. group_by: ['job', 'severity']
    11. group_wait: 30s
    12. group_interval: 5m
    13. repeat_interval: 3h
    14. receiver: 'webhook1'
    15. receivers:
    16. - name: 'webhook1'
    17. webhook_configs:
    18. - url: 'http://192.168.170.136:8060/dingtalk/webhook1/send'
    19. send_resolved: true

    1.3.6 prometheus.yml 

    1. [root@hadoop01 ~]# cd /monitor/prometheus
    2. [root@hadoop01 /monitor/prometheus]# ls
    3. console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool rule
    4. [root@hadoop01 /monitor/prometheus]# vim prometheus.yml
    5. # my global config
    6. global:
    7. scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
    8. evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
    9. # Alertmanager configuration
    10. alerting:
    11. alertmanagers:
    12. - static_configs:
    13. - targets: ['192.168.170.136:9093']
    14. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    15. rule_files:
    16. - "rule/*.yml"
    17. # A scrape configuration containing exactly one endpoint to scrape:
    18. # Here it's Prometheus itself.
    19. scrape_configs:
    20. - job_name: "prometheus"
    21. scrape_interval: 30s
    22. static_configs:
    23. - targets: ["hadoop01:9090"]
    24. # zookeeper 集群配置
    25. - job_name: "zookeeper"
    26. scrape_interval: 30s
    27. static_configs:
    28. - targets: ['hadoop01:30011', 'hadoop02:30011', 'hadoop03:30011']
    29. # node_exporter 配置
    30. - job_name: "pushgatewawy"
    31. scrape_interval: 30s
    32. static_configs:
    33. - targets: ["hadoop01:9091"]
    34. # node_exporter 配置
    35. - job_name: "node_exporter"
    36. scrape_interval: 30s
    37. static_configs:
    38. - targets: ['hadoop01:9100', 'hadoop02:9100', 'hadoop03:9100']
    39. - job_name: " namenode "
    40. scrape_interval: 30s
    41. static_configs:
    42. - targets: ['hadoop01:30002', 'hadoop02:30002']
    43. # labels:
    44. # instance: namenode 服务器
    45. - job_name: "datanode"
    46. scrape_interval: 30s
    47. static_configs:
    48. - targets: ['hadoop01:30003', 'hadoop02:30003', 'hadoop03:30003']
    49. - job_name: "resourcemanager"
    50. scrape_interval: 30s
    51. static_configs:
    52. - targets: ['hadoop01:30004', 'hadoop02:30004']
    53. - job_name: "nodemanager"
    54. scrape_interval: 30s
    55. static_configs:
    56. - targets: ['hadoop01:30005', 'hadoop02:30005', 'hadoop03:30005']
    57. - job_name: "journalnode"
    58. scrape_interval: 30s
    59. static_configs:
    60. - targets: ['hadoop01:30006', 'hadoop02:30006', 'hadoop03:30006']
    61. - job_name: "zkfc"
    62. scrape_interval: 30s
    63. static_configs:
    64. - targets: ['hadoop01:30007', 'hadoop02:30007']
    65. - job_name: "jobhistoryserver"
    66. scrape_interval: 30s
    67. static_configs:
    68. - targets: ["hadoop01:30010"]
    69. - job_name: "spark_master"
    70. scrape_interval: 30s
    71. static_configs:
    72. - targets: ['hadoop01:30012', 'hadoop02:30012']
    73. - job_name: "spark_worker"
    74. scrape_interval: 30s
    75. static_configs:
    76. - targets: ['hadoop01:30013', 'hadoop02:30013', 'hadoop03:30013']
    77. - job_name: "hive_metastore"
    78. scrape_interval: 30s
    79. static_configs:
    80. - targets: ["hadoop01:30014"]
    81. - job_name: "hive_hs2"
    82. scrape_interval: 30s
    83. static_configs:
    84. - targets: ["hadoop01:30015"]

    1.3.7 config.yml 

    1. [root@hadoop01 ~]# cd /monitor/prometheus-webhook-dingtalk/
    2. [root@hadoop01 /monitor/prometheus-webhook-dingtalk]# ls
    3. config.example.yml config.yml contrib LICENSE nohup.out prometheus-webhook-dingtalk
    4. [root@hadoop01 /monitor/prometheus-webhook-dingtalk]# vim config.yml
    5. ## Request timeout
    6. # timeout: 5s
    7. ## Uncomment following line in order to write template from scratch (be careful!)
    8. #no_builtin_template: true
    9. ## Customizable templates path
    10. templates:
    11. - /monitor/prometheus-webhook-dingtalk/contrib/templates/legacy/template.tmpl
    12. ## You can also override default template using `default_message`
    13. ## The following example to use the 'legacy' template from v0.3.0
    14. #default_message:
    15. # title: '{{ template "legacy.title" . }}'
    16. # text: '{{ template "legacy.content" . }}'
    17. ## Targets, previously was known as "profiles"
    18. targets:
    19. webhook1:
    20. url: https://oapi.dingtalk.com/robot/send?access_token=0d6c5dc25fa3f79cf2f83c92705fe4594dcxxx
    21. # secret for signature
    22. secret: SECecdbfff858ab8f3195dc34b7e225fee9341bc9xxx
    23. message:
    24. title: '{{ template "ops.title" . }}'
    25. text: '{{ template "ops.content" . }}'

    1.3.8 template.tmpl 

    1. [root@hadoop01 ~]# cd /monitor/prometheus-webhook-dingtalk/contrib/templates/legacy/
    2. [root@hadoop01 /monitor/prometheus-webhook-dingtalk/contrib/templates/legacy]# vim template.tmpl
    3. {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
    4. {{ end }}
    5. {{ define "__alert_list" }}{{ range . }}
    6. ---
    7. **告警类型**: {{ .Labels.alertname }}
    8. **告警级别**: {{ .Labels.severity }}
    9. **故障主机**: {{ .Labels.instance }}
    10. **告警信息**: {{ .Annotations.description }}
    11. **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    12. {{ end }}{{ end }}
    13. {{ define "__resolved_list" }}{{ range . }}
    14. ---
    15. **告警类型**: {{ .Labels.alertname }}
    16. **告警级别**: {{ .Labels.severity }}
    17. **故障主机**: {{ .Labels.instance }}
    18. **触发时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    19. **恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    20. {{ end }}{{ end }}
    21. {{ define "ops.title" }}
    22. {{ template "__subject" . }}
    23. {{ end }}
    24. {{ define "ops.content" }}
    25. {{ if gt (len .Alerts.Firing) 0 }}
    26. **====侦测到{{ .Alerts.Firing | len }}个故障====**
    27. {{ template "__alert_list" .Alerts.Firing }}
    28. ---
    29. {{ end }}
    30. {{ if gt (len .Alerts.Resolved) 0 }}
    31. **====恢复{{ .Alerts.Resolved | len }}个故障====**
    32. {{ template "__resolved_list" .Alerts.Resolved }}
    33. {{ end }}
    34. {{ end }}
    35. {{ define "ops.link.title" }}{{ template "ops.title" . }}{{ end }}
    36. {{ define "ops.link.content" }}{{ template "ops.content" . }}{{ end }}
    37. {{ template "ops.title" . }}
    38. {{ template "ops.content" . }}

    1.3.9 告警规则

    在第二点下面的文件里下载即可。

    1.3.10 /etc/profile

    修改完后要把这个文件 scp 给各个 Hadoop 节点!

    1. # JDK 1.8
    2. JAVA_HOME=/usr/java/jdk1.8.0_381
    3. PATH=$PATH:$JAVA_HOME/bin
    4. CLASSPATH=.:$JAVA_HOME/lib
    5. export JAVA_HOME PATH CLASSPATH
    6. # hadoop
    7. export HADOOP_HOME=/bigdata/hadoop/server/hadoop-3.2.4/
    8. export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    9. # spark
    10. export SPARK_HOME=/bigdata/spark-3.2.4
    11. export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    12. export PYSPARK_PYTHON=/usr/local/anaconda3/envs/pyspark/bin/python3.10
    13. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native

    1.3.11 prometheus_spark.yml

    修改完后要把这个文件 scp 给各个 spark 节点!

    1. (base) [root@hadoop01 ~]# cd /bigdata/spark-3.2.4/
    2. (base) [root@hadoop01 /bigdata/spark-3.2.4]# vim prometheus_spark.yml
    3. rules:
    4. # These come from the master
    5. # Example: master.aliveWorkers
    6. - pattern: "metrics<>Value"
    7. name: spark_master_$1
    8. # These come from the worker
    9. # Example: worker.coresFree
    10. - pattern: "metrics<>Value"
    11. name: spark_worker_$1
    12. # These come from the application driver
    13. # Example: app-20160809000059-0000.driver.DAGScheduler.stage.failedStages
    14. - pattern: "metrics<>Value"
    15. name: spark_driver_$2_$3
    16. type: GAUGE
    17. labels:
    18. app_id: "$1"
    19. # These come from the application driver
    20. # Emulate timers for DAGScheduler like messagePRocessingTime
    21. - pattern: "metrics<>Count"
    22. name: spark_driver_DAGScheduler_$2_total
    23. type: COUNTER
    24. labels:
    25. app_id: "$1"
    26. - pattern: "metrics<>Count"
    27. name: spark_driver_HiveExternalCatalog_$2_total
    28. type: COUNTER
    29. labels:
    30. app_id: "$1"
    31. # These come from the application driver
    32. # Emulate histograms for CodeGenerator
    33. - pattern: "metrics<>Count"
    34. name: spark_driver_CodeGenerator_$2_total
    35. type: COUNTER
    36. labels:
    37. app_id: "$1"
    38. # These come from the application driver
    39. # Emulate timer (keep only count attribute) plus counters for LiveListenerBus
    40. - pattern: "metrics<>Count"
    41. name: spark_driver_LiveListenerBus_$2_total
    42. type: COUNTER
    43. labels:
    44. app_id: "$1"
    45. # Get Gauge type metrics for LiveListenerBus
    46. - pattern: "metrics<>Value"
    47. name: spark_driver_LiveListenerBus_$2
    48. type: GAUGE
    49. labels:
    50. app_id: "$1"
    51. # These come from the application driver if it's a streaming application
    52. # Example: app-20160809000059-0000.driver.com.example.ClassName.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay
    53. - pattern: "metrics<>Value"
    54. name: spark_driver_streaming_$3
    55. labels:
    56. app_id: "$1"
    57. app_name: "$2"
    58. # These come from the application driver if it's a structured streaming application
    59. # Example: app-20160809000059-0000.driver.spark.streaming.QueryName.inputRate-total
    60. - pattern: "metrics<>Value"
    61. name: spark_driver_structured_streaming_$3
    62. labels:
    63. app_id: "$1"
    64. query_name: "$2"
    65. # These come from the application executors
    66. # Examples:
    67. # app-20160809000059-0000.0.executor.threadpool.activeTasks (value)
    68. # app-20160809000059-0000.0.executor.JvmGCtime (counter)
    69. # filesystem metrics are declared as gauge metrics, but are actually counters
    70. - pattern: "metrics<>Value"
    71. name: spark_executor_filesystem_$3_total
    72. type: COUNTER
    73. labels:
    74. app_id: "$1"
    75. executor_id: "$2"
    76. - pattern: "metrics<>Value"
    77. name: spark_executor_$3
    78. type: GAUGE
    79. labels:
    80. app_id: "$1"
    81. executor_id: "$2"
    82. - pattern: "metrics<>Count"
    83. name: spark_executor_$3_total
    84. type: COUNTER
    85. labels:
    86. app_id: "$1"
    87. executor_id: "$2"
    88. - pattern: "metrics<>Value"
    89. name: spark_executor_$3
    90. type: GAUGE
    91. labels:
    92. app_id: "$1"
    93. executor_id: "$2"
    94. # These come from the application executors
    95. # Example: app-20160809000059-0000.0.jvm.threadpool.activeTasks
    96. - pattern: "metrics<>Value"
    97. name: spark_executor_$3_$4
    98. type: GAUGE
    99. labels:
    100. app_id: "$1"
    101. executor_id: "$2"
    102. - pattern: "metrics<>Count"
    103. name: spark_executor_HiveExternalCatalog_$3_total
    104. type: COUNTER
    105. labels:
    106. app_id: "$1"
    107. executor_id: "$2"
    108. # These come from the application driver
    109. # Emulate histograms for CodeGenerator
    110. - pattern: "metrics<>Count"
    111. name: spark_executor_CodeGenerator_$3_total
    112. type: COUNTER
    113. labels:
    114. app_id: "$1"
    115. executor_id: "$2"

    1.3.12 spark-env.sh

    修改完后要把这个文件 scp 给各个 spark 节点!

    1. (base) [root@hadoop01 ~]# cd /bigdata/spark-3.2.4/conf/
    2. (base) [root@hadoop01 /bigdata/spark-3.2.4/conf]# vim spark-env.sh
    3. export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30012:/bigdata/spark-3.2.4/prometheus_spark.yml"
    4. export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30013:/bigdata/spark-3.2.4/prometheus_spark.yml"

    1.3.13 hive 

    1. (base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
    2. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim bin/hive
    3. ···
    4. if [[ "$SERVICE" =~ ^(help|version|orcfiledump|rcfilecat|schemaTool|cleardanglingscratchdir|metastore|beeline|llapstatus|llap)$ ]] ; then
    5. # 如果是 metastore 服务,则修改 HADOOP_CLIENT_OPTS
    6. if [[ "$SERVICE" == "metastore" ]] ; then
    7. export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30014:/bigdata/apache-hive-3.1.2/prometheus_metastore.yaml"
    8. fi
    9. SKIP_HBASECP=true
    10. fi
    11. ···
    12. if [[ "$SERVICE" =~ ^(hiveserver2|beeline|cli)$ ]] ; then
    13. # 如果是 hiveserver2 服务,则修改 HADOOP_CLIENT_OPTS
    14. if [[ "$SERVICE" == "hiveserver2" ]] ; then
    15. export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -javaagent:/monitor/jmx_prometheus_javaagent-0.19.0.jar=30015:/bigdata/apache-hive-3.1.2/prometheus_hs2.yaml"
    16. fi
    17. # If process is backgrounded, don't change terminal settings
    18. if [[ ( ! $(ps -o stat= -p $$) =~ "+" ) && ! ( -p /dev/stdin ) && ( ! $(ps -o tty= -p $$) =~ "?" ) ]]; then
    19. export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Djline.terminal=jline.UnsupportedTerminal"
    20. fi
    21. fi
    22. ···

    1.3.14 prometheus_metastore.yaml、prometheus_hs2.yaml 

    1. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim prometheus_metastore.yaml
    2. ---
    3. startDelaySeconds: 0
    4. ssl: false
    5. lowercaseOutputName: false
    6. lowercaseOutputLabelNames: false
    7. rules:
    8. - pattern: ".*"
    9. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim prometheus_hs2.yaml
    10. ---
    11. startDelaySeconds: 0
    12. ssl: false
    13. lowercaseOutputName: false
    14. lowercaseOutputLabelNames: false
    15. rules:
    16. - pattern: ".*"

    1.4 创建 systemd 服务

    1.4.1 创建 prometheus 用户

            各个节点都需要创建!!!(也可以不创建 prometheus 用户,把后面 service 文件的 prometheus 改为 root 即可!)

    1. useradd -M -s /usr/sbin/nologin prometheus
    2. chown -R prometheus:prometheus /monitor

    1.4.2 alertmanager.service 

    1. [root@hadoop01 ~]# vim /usr/lib/systemd/system/alertmanager.service
    2. [Unit]
    3. Description=Alertmanager
    4. Documentation=https://prometheus.io/docs/alerting/alertmanager/
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. User=prometheus
    9. Group=prometheus
    10. Type=simple
    11. ExecStart=/monitor/alertmanager/alertmanager \
    12. --config.file=/monitor/alertmanager/alertmanager.yml \
    13. --storage.path=/monitor/alertmanager/data \
    14. --web.listen-address=0.0.0.0:9093
    15. ExecReload=/bin/kill -HUP $MAINPID
    16. Restart=always
    17. [Install]
    18. WantedBy=multi-user.target

    1.4.3 prometheus.service 

    1. [root@hadoop01 ~]# vim /usr/lib/systemd/system/prometheus.service
    2. [Unit]
    3. Description=Prometheus Server
    4. Documentation=https://prometheus.io/docs/introduction/overview/
    5. After=network-online.target
    6. [Service]
    7. Type=simple
    8. User=prometheus
    9. Group=prometheus
    10. WorkingDirectory=/monitor/prometheus
    11. ExecStart=/monitor/prometheus/prometheus \
    12. --web.listen-address=0.0.0.0:9090 \
    13. --storage.tsdb.path=/monitor/prometheus/data \
    14. --storage.tsdb.retention.time=30d \
    15. --config.file=prometheus.yml \
    16. --web.enable-lifecycle
    17. ExecReload=/bin/kill -s HUP $MAINPID
    18. ExecStop=/bin/kill -s QUIT $MAINPID
    19. Restart=on-failure
    20. [Install]
    21. WantedBy=multi-user.target

    1.4.4 node_exporter.service 

    1. [root@hadoop01 ~]# vim /usr/lib/systemd/system/node_exporter.service
    2. [Unit]
    3. Description=Node Exporter
    4. Documentation=https://github.com/prometheus/node_exporter
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. User=prometheus
    9. Group=prometheus
    10. Type=simple
    11. ExecStart=/monitor/node_exporter/node_exporter
    12. [Install]
    13. WantedBy=multi-user.target

    1.4.5 pushgateway.service 

    1. [root@hadoop01 ~]# vim /usr/lib/systemd/system/pushgateway.service
    2. [Unit]
    3. Description=Pushgateway Server
    4. Documentation=https://github.com/prometheus/pushgateway
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. User=prometheus
    9. Group=prometheus
    10. Type=simple
    11. ExecStart=/monitor/pushgateway/pushgateway \
    12. --web.listen-address=:9091 \
    13. --web.telemetry-path=/metrics
    14. Restart=always
    15. [Install]
    16. WantedBy=multi-user.target

    1.4.6 grafana.service 

    1. [root@hadoop01 ~]# vim /usr/lib/systemd/system/grafana.service
    2. [Unit]
    3. Description=Grafana Server
    4. Documentation=http://docs.grafana.org
    5. After=network-online.target
    6. Wants=network-online.target
    7. [Service]
    8. Type=simple
    9. User=prometheus
    10. Group=prometheus
    11. ExecStart=/monitor/grafana/bin/grafana-server \
    12. --config=/monitor/grafana/conf/defaults.ini \
    13. --homepath=/monitor/grafana
    14. Restart=on-failure
    15. RestartSec=10
    16. StandardOutput=syslog
    17. StandardError=syslog
    18. SyslogIdentifier=grafana
    19. Environment=GRAFANA_HOME=/monitor/grafana \
    20. GRAFANA_USER=prometheus \
    21. GRAFANA_GROUP=prometheus
    22. [Install]
    23. WantedBy=multi-user.target

    1.5 启动服务

    把上述服务启动即可!

    注意:prometheus-webhook-dingtalk 服务需要用下面方式启动:

    1. cd /monitor/prometheus-webhook-dingtalk/
    2. nohup ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --config.file="/monitor/prometheus-webhook-dingtalk/config.yml" &

    二、补充

    2.1 告警规则和 grafana 仪表盘文件下载

    1. [root@hadoop01 ~]# cd /monitor/prometheus/
    2. [root@hadoop01 /monitor/prometheus]# ls
    3. console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool rule
    4. [root@hadoop01 /monitor/prometheus]# ls rule/
    5. HDFS.yml node.yml spark_master.yml spark_worker.yml yarn.yml zookeeper.yml

    文件下载链接:【免费】prometheus告警规则文件和grafana仪表盘文件资源-CSDN文库

    2.2 服务进程监控脚本

    2.2.1 prometheus-webhook-dingtalk

    1. [root@hadoop01 ~]# cd /monitor/prometheus-webhook-dingtalk/
    2. [root@hadoop01 /monitor/prometheus-webhook-dingtalk]# vim monitor_prometheus_webhook_dingtalk.sh
    3. #!/bin/bash
    4. # 获取当前系统时间
    5. current_time=$(date "+%Y-%m-%d %H:%M:%S")
    6. # 定义日志文件路径
    7. log_file="/monitor/prometheus-webhook-dingtalk/monitor.log"
    8. echo "[$current_time] Checking if prometheus-webhook-dingtalk process is running..." >> $log_file
    9. # 检查进程是否在运行
    10. if ! /usr/bin/pgrep -fx "/monitor/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --web.listen-address=0.0.0.0:8060 --config.file=/monitor/prometheus-webhook-dingtalk/config.yml" >> $log_file; then
    11. echo "[$current_time] prometheus-webhook-dingtalk process is not running. Starting it now..." >> $log_file
    12. # 使用绝对路径和 nohup 在后台启动进程
    13. /usr/bin/nohup /monitor/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --config.file="/monitor/prometheus-webhook-dingtalk/config.yml" >> /monitor/prometheus-webhook-dingtalk/output.log 2>&1 &
    14. else
    15. echo "[$current_time] prometheus-webhook-dingtalk process is running." >> $log_file
    16. fi
    17. [root@hadoop01 /monitor/prometheus-webhook-dingtalk]# chmod 777 monitor_prometheus_webhook_dingtalk.sh
    18. [root@hadoop01 /monitor/prometheus-webhook-dingtalk]# crontab -e
    19. * * * * * /usr/bin/bash /monitor/prometheus-webhook-dingtalk/monitor_prometheus_webhook_dingtalk.sh

    2.2.2 hive 

    1. (base) [root@hadoop01 ~]# cd /bigdata/apache-hive-3.1.2/
    2. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# vim monitor_hive.sh
    3. #!/bin/bash
    4. # 获取当前系统时间
    5. current_time=$(date "+%Y-%m-%d %H:%M:%S")
    6. # 定义日志文件路径
    7. log_file_metastore="/bigdata/apache-hive-3.1.2/monitor_metastore.log"
    8. log_file_hs2="/bigdata/apache-hive-3.1.2/monitor_hs2.log"
    9. echo "[$current_time] Checking if hive metastore and hs2 processes are running..."
    10. # 检查 Hive Metastore 是否在运行
    11. echo "[$current_time] Checking if hive metastore process is running..." >> $log_file_metastore
    12. if ! /usr/bin/pgrep -f "hive-metastore-3.1.2.jar" >> $log_file_metastore; then
    13. echo "[$current_time] hive metastore process is not running. Starting it now..." >> $log_file_metastore
    14. # 使用绝对路径和 nohup 在后台启动进程
    15. /usr/bin/nohup /bigdata/apache-hive-3.1.2/bin/hive --service metastore >> /bigdata/apache-hive-3.1.2/metastore_output.log 2>&1 &
    16. # 等待一点时间以确保 metastore 完全启动
    17. sleep 30
    18. else
    19. echo "[$current_time] hive metastore process is running." >> $log_file_metastore
    20. fi
    21. # 检查 HiveServer2 是否在运行
    22. echo "[$current_time] Checking if hive hs2 process is running..." >> $log_file_hs2
    23. if ! /usr/bin/pgrep -f "HiveServer2" >> $log_file_hs2; then
    24. echo "[$current_time] hive hs2 process is not running. Starting it now..." >> $log_file_hs2
    25. # 使用绝对路径和 nohup 在后台启动进程
    26. /usr/bin/nohup /bigdata/apache-hive-3.1.2/bin/hive --service hiveserver2 >> /bigdata/apache-hive-3.1.2/hs2_output.log 2>&1 &
    27. else
    28. echo "[$current_time] hive hs2 process is running." >> $log_file_hs2
    29. fi
    30. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# chmod 777 montior_metastore.sh
    31. (base) [root@hadoop01 /bigdata/apache-hive-3.1.2]# crontab -e
    32. * * * * * /usr/bin/bash /bigdata/apache-hive-3.1.2/monitor_hive.sh

    2.2.3 日志切割 

    1. [root@hadoop01 ~]# vim /etc/logrotate.d/prometheus-webhook-dingtalk
    2. /monitor/prometheus-webhook-dingtalk/monitor.log \
    3. /bigdata/apache-hive-3.1.2/monitor_metastore.log \
    4. /bigdata/apache-hive-3.1.2/monitor_hs2.log {
    5. daily
    6. rotate 7
    7. size 150M
    8. compress
    9. maxage 30
    10. missingok
    11. notifempty
    12. create 0644 root root
    13. copytruncate
    14. }
    15. # 测试调式 logrotate 配置
    16. [root@hadoop01 ~]# logrotate -d /etc/logrotate.d/prometheus-webhook-dingtalk
    17. # 手动执行日志轮换
    18. logrotate -f /etc/logrotate.d/prometheus-webhook-dingtalk

    2.3 常用命令 

    1. # 检查 prometheus 配置文件,包括告警规则文件
    2. [root@hadoop01 ~]# cd /monitor/prometheus
    3. ./promtool check config prometheus.yml
    4. # 重启 prometheus 配置
    5. curl -X POST http://localhost:9090/-/reload
    6. # 测试发送信息到机器人
    7. curl 'https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx' \
    8. -H 'Content-Type: application/json' \
    9. -d '{"msgtype": "text","text": {"content":"我就是我, 是不一样的烟火"}}'

    2.4 参考文档

    • Hadoop 官方监控指标:https://hadoop.apache.org/docs/r3.2.4/hadoop-project-dist/hadoop-common/Metrics.html

    • 阿里云监控指标:https://help.aliyun.com/zh/emr/emr-on-ecs/user-guide/hdfs-metrics?spm=a2c4g.11186623.0.0.11ba6daalnmBWn

    • 阿里云 grafana 仪表盘:https://help.aliyun.com/document_detail/2326798.html?spm=a2c4g.462292.0.0.4c4c5d35uXCP6k#section-1bn-bzq-fw3

    • jmx_exporter 配置文件参考:https://github.com/prometheus/jmx_exporter/tree/main/example_configs

    • 钉钉机器人文档:https://open.dingtalk.com/document/robots/custom-robot-access

    2.5 内网环境

    2.5.1 nginx(公网)

    需要一台可以访问公网的 nginx 服务器来代理钉钉 api:

    1. [root@idc-master-02 ~]# cat /etc/nginx/conf.d/uat-prometheus-webhook-dingtalk.conf
    2. server {
    3. listen 30080;
    4. location /robot/send {
    5. proxy_pass https://oapi.dingtalk.com;
    6. proxy_set_header Host oapi.dingtalk.com;
    7. proxy_set_header X-Real-IP $remote_addr;
    8. proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    9. proxy_set_header X-Forwarded-Proto $scheme;
    10. }
    11. }

    2.5.2 config 

    1. [root@localhost-13 ~]# cat /opt/prometheus-webhook-dingtalk/config.yml
    2. targets:
    3. webhook1:
    4. url: http://10.0.4.11:30080/robot/send?access_token=0d6c5dc25fa3f79cf2f83c92705fe4594dcc5b3xxx
    5. secret: SECecdbfff858ab8f3195dc34b7e225fee93xxx
    6. message:
    7. title: '{{ template "ops.title" . }}'
    8. text: '{{ template "ops.content" . }}'

    2.5.3 /etc/hosts 

    1. [root@localhost-13 ~]# cat /etc/hosts
    2. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
    3. ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
    4. 10.0.4.11 oapi.dingtalk.com
  • 相关阅读:
    sql 5
    c++ 类的实例化顺序
    Json的使用,以及@JsonProperty和@JSONFiled注解的混淆
    挑战 Google 搜索?OpenAI 发布最强 AI 对话系统 ChatGPT
    数字化时代,企业转型发展可能会有哪些特征?
    百度网盘的扩容
    《LeetCode力扣练习》代码随想录——二叉树(平衡二叉树---Java)
    MindSponge分子动力学模拟——定义Collective Variables(2024.02)
    marven项目打包第三方jar包
    用golang container/list 实现队列并控制并发
  • 原文地址:https://blog.csdn.net/weixin_46560589/article/details/133811785