• KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障


    案例说明:
    在KingbaseES V8R6集群备库执行“repmgr standby switchover”时,切换失败,并且在执行过程中,伴随着“repmr standby follow”操作,本案例详细记录了解决此问题的过程。

    适用版本: KingbaseES V8R6

    集群节点信息:

    一、备库执行switchover操作

    1、执行switchover切换

    1. [kingbase@node101 bin]$ ./repmgr standby switchover -h 192.168.1.102 -U esrep -d esrep
    2. WARNING: following problems with command line parameters detected:
    3. database connection parameters not required when executing UNKNOWN ACTION
    4. NOTICE: executing switchover on node "node101" (ID: 1)
    5. ERROR: local node "node101" (ID: 1) is not a downstream of demotion candidate primary "node102" (ID: 2)
    6. DETAIL: local node has no registered upstream node
    7. HINT: execute "repmgr standby register --force" to update the local node's metadata

    2、切换失败信息

    3、查看集群节点状态

    1. [kingbase@node101 bin]$ ./repmgr cluster show
    2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    3. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    4. 1 | node101 | standby | running | | default | 100 | 6 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    5. 2 | node102 | primary | * running | | default | 100 | 6 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

    =如下所示,standby节点的upstream为空,无法执行switchover。=

    二、配置standby节点的upstream(repmgr standby follow)

    1、执行“repmgr standby follow”

    1. [kingbase@node101 bin]$ ./repmgr standby follow -h 192.168.1.102 -U esrep -d esrep
    2. NOTICE: attempting to find and follow current primary
    3. INFO: timelines are same, this server is not ahead
    4. DETAIL: local node lsn is 1/CE004F50, follow target lsn is 1/CE004F50
    5. ERROR: slot "repmgr_slot_1" already exists as an active slot
    6. NOTICE: STANDBY FOLLOW failed
    7. DETAIL: slot "repmgr_slot_1" already exists as an active slot
    8. # standby的replication slot是active状态
    9. test=# select * from sys_replication_slots;
    10. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confir
    11. med_flush_lsn
    12. ---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+-------
    13. --------------
    14. repmgr_slot_1 | | physical | | | f | t | 8596 | 1917 | | 1/CE005038 |
    15. (1 row)

    2、停止数据库删除standby的replication slot

    1. # 关闭备库数据库服务
    2. [kingbase@node101 bin]$ ./sys_ctl stop -D ../data
    3. waiting for server to shut down.... done
    4. server stopped
    5. # 注释kingbase.auto.conf中slot参数
    6. [kingbase@node101 bin]$ cat ../data/kingbase.auto.conf
    7. # Do not edit this file manually!
    8. # It will be overwritten by the ALTER SYSTEM command.
    9. enable_upper_colname = 'on'
    10. client_idle_timeout = '0'
    11. synchronous_standby_names = ''
    12. wal_retrieve_retry_interval = '5000'
    13. primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
    14. recovery_target_timeline = 'latest'
    15. # primary_slot_name = 'repmgr_slot_1'
    16. # 查看slot状态
    17. test=# select * from sys_replication_slots;
    18. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confir
    19. med_flush_lsn
    20. ---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+-------
    21. --------------
    22. repmgr_slot_1 | | physical | | | f | f | | 1922 | | 1/CE005038 |
    23. (1 row)
    24. # 删除备库replication slot
    25. test=# select sys_drop_replication_slot('repmgr_slot_1');
    26. sys_drop_replication_slot
    27. ---------------------------
    28. (1 row)

    3、启动数据库服务执行"repmgr standby follow"

    1. [kingbase@node101 bin]$ ./sys_ctl start -D ../data
    2. waiting for server to start....2022-08-09 10:39:50.600 CST [6829] WARNING: enable_upper_colname can only be opened
    3. ......
    4. server started
    5. [kingbase@node101 bin]$ ./repmgr standby follow -h 192.168.1.102 -U esrep -d esrep
    6. NOTICE: attempting to find and follow current primary
    7. INFO: timelines are same, this server is not ahead
    8. DETAIL: local node lsn is 1/CE0052E0, follow target lsn is 1/CE0052E0
    9. NOTICE: setting node 1's upstream to node 2
    10. NOTICE: begin to stopp server at 2022-08-09 10:39:55.101228
    11. NOTICE: stopping server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile -w -t 90 -m fast stop"
    12. NOTICE: stopp server finish at 2022-08-09 10:39:55.205646
    13. NOTICE: begin to start server at 2022-08-09 10:39:55.205705
    14. NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
    15. NOTICE: start server finish at 2022-08-09 10:39:55.316793
    16. NOTICE: STANDBY FOLLOW successful
    17. DETAIL: standby attached to upstream node "node102" (ID: 2)
    18. # 集群节点状态正常
    19. [kingbase@node101 bin]$ ./repmgr cluster show
    20. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    21. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    22. 1 | node101 | standby | running | node102 | default | 100 | 6 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    23. 2 | node102 | primary | * running | | default | 100 | 6 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

    三、执行‘repmgr standby switchover’

    1. [kingbase@node101 bin]$ ./repmgr standby switchover -h 192.168.1.102 -U esrep -d esrep
    2. WARNING: following problems with command line parameters detected:
    3. database connection parameters not required when executing UNKNOWN ACTION
    4. NOTICE: executing switchover on node "node101" (ID: 1)
    5. INFO: The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
    6. "
    7. INFO: pausing repmgrd on node "node101" (ID 1)
    8. INFO: pausing repmgrd on node "node102" (ID 2)
    9. NOTICE: local node "node101" (ID: 1) will be promoted to primary; current primary "node102" (ID: 2) will be demoted to standby
    10. NOTICE: stopping current primary node "node102" (ID: 2)
    11. NOTICE: issuing CHECKPOINT
    12. DETAIL: executing server command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile -W -m fast stop"
    13. INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
    14. INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
    15. NOTICE: current primary has been cleanly shut down at location 1/D0000028
    16. NOTICE: promoting standby to primary
    17. DETAIL: promoting server "node101" (ID: 1) using sys_promote()
    18. NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
    19. NOTICE: STANDBY PROMOTE successful
    20. DETAIL: server "node101" (ID: 1) was successfully promoted to primary
    21. NOTICE: issuing CHECKPOINT
    22. INFO: local node 2 can attach to rejoin target node 1
    23. DETAIL: local node's recovery point: 1/D0000028; rejoin target node's fork point: 1/D00000A0
    24. NOTICE: setting node 2's upstream to node 1
    25. WARNING: unable to ping "host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
    26. DETAIL: PQping() returned "PQPING_NO_RESPONSE"
    27. NOTICE: begin to start server at 2022-08-09 10:46:36.382548
    28. NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
    29. NOTICE: start server finish at 2022-08-09 10:46:36.488870
    30. NOTICE: replication slot "repmgr_slot_1" deleted on node 2
    31. NOTICE: NODE REJOIN successful
    32. DETAIL: node 2 is now attached to node 1
    33. NOTICE: switchover was successful
    34. DETAIL: node "node101" is now primary and node "node102" is attached as standby
    35. INFO: unpausing repmgrd on node "node101" (ID 1)
    36. INFO: unpause node "node101" (ID 1) successfully
    37. INFO: unpausing repmgrd on node "node102" (ID 2)
    38. INFO: unpause node "node102" (ID 2) successfully
    39. NOTICE: STANDBY SWITCHOVER has completed successfully
    40. # 集群节点状态信息
    41. [kingbase@node101 bin]$ ./repmgr cluster show
    42. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    43. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    44. 1 | node101 | primary | * running | | default | 100 | 7 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    45. 2 | node102 | standby | running | node101 | default | 100 | 7 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

    =如上所示,switchover切换完成。=

  • 相关阅读:
    AKS for Gitpod
    物联网AI MicroPython学习之语法 SPI串行外设通信
    Git的注册登录以及具体使用
    十五.镜头知识之景深(Depth of Field)
    JVM面试题(三)
    实例分割Yolact边缘端部署 (二) 训练自己的模型-> onnx
    PC_浮点数加减运算
    Vue2 之 Vuex - 状态管理
    C++ 操作mysql / mariadb数据库api,包含完整的示例代码
    学习nginx的一点记录
  • 原文地址:https://blog.csdn.net/lyu1026/article/details/126317759