• 故障解析丨Clone节点导致主从故障


    1.背景概述

    在一次主从复制架构中,由于主节点binlog损坏,导致从节点无法正常同步数据,只能重做从节点;因此使用MySQL 8.0.17开始提供的clone技术进行恢复,恢复后的2天都发生了主从报错数据冲突。

    通过解析binlog发现,同一时刻主从节点都在执行同一条语句,因此询问业务是否在主从节点都执行了定时任务,业务回复定时任务只在主节点执行。

    最后排查发现,克隆后的从节点的定时任务也会是开启的状态,因此同一时刻,主从节点同时执行定时任务,导致主从报错,最终将从节点的定时任务关闭后解决此问题。

    2.问题复现

    本次测试基于 GreatSQL 8.0.32-24

    greatsql> SELECT VERSION();
    +-----------+
    | VERSION() |
    +-----------+
    | 8.0.32-24 |
    +-----------+
    1 row in set (0.00 sec)
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    1.搭建一套主从架构

    2.创建event

    greatsql> create database test;
    greatsql> use test;
    greatsql> CREATE TABLE `test` (
     `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
     `now` datetime DEFAULT NULL COMMENT '时间',
     PRIMARY KEY (`id`)
    );
    
    greatsql> CREATE EVENT event_test 
    ON SCHEDULE EVERY 1 MINUTE
    ON COMPLETION PRESERVE 
    ENABLE 
    COMMENT '每隔1分钟向test表插入记录'
    DO INSERT INTO test VALUES(NULL, now());
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    3.查看event状态

    主节点,默认情况下event状态为 ENABLED

    greatsql> show events;
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
    | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status  | Originator | character_set_client | collation_connection | Database Collation |
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
    | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-10-12 17:11:14 | NULL | ENABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
    1 row in set (0.00 sec)
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    从节点,默认情况下event状态为 SLAVESIDE_DISABLED

    greatsql> show events;
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
    | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status       | Originator | character_set_client | collation_connection | Database Collation |
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
    | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-10-12 17:11:14 | NULL | SLAVESIDE_DISABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
    +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
    1 row in set (0.00 sec)
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    4.查看数据

    greatsql> select * from test.test;
    +----+---------------------+
    | id | now                 |
    +----+---------------------+
    |  1 | 2023-08-08 16:00:39 |
    |  2 | 2023-08-08 16:01:39 |
    |  3 | 2023-08-08 16:02:39 |
    +----+---------------------+
    3 rows in set (0.00 sec)
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    5.从节点进行克隆

    # 安装克隆插件,主从节点都需要

    greatsql> install plugin clone soname 'mysql_clone.so';

      # 从节点进行clone

      greatsql> set global clone_valid_donor_list='172.17.137.162:6001';
      
      greatsql> clone instance from root@'172.17.137.162':6001 identified by 'greatsql';
      • 1
      • 2

      6.重新建立主从复制

      greatsql> change master to master_user='root',master_password='greatsql',master_host='172.17.137.162',master_port=6001,master_auto_position=1;
      Query OK, 0 rows affected, 7 warnings (0.04 sec)
      
      greatsql> start slave;
      Query OK, 0 rows affected, 1 warning (0.04 sec)
      • 1
      • 2
      • 3
      • 4

      7.查看主从状态

      greatsql> show slave status\G
      *************************** 1. row ***************************
                     Slave_IO_State: Waiting for source to send event
                        Master_Host: 172.17.137.162
                        Master_User: root
                        Master_Port: 6001
                      Connect_Retry: 60
                    Master_Log_File: binlog.000001
                Read_Master_Log_Pos: 2959
                     Relay_Log_File: relaylog.000002
                      Relay_Log_Pos: 395
              Relay_Master_Log_File: binlog.000001
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB: 
                Replicate_Ignore_DB: 
                 Replicate_Do_Table: 
             Replicate_Ignore_Table: 
            Replicate_Wild_Do_Table: 
        Replicate_Wild_Ignore_Table: 
                         Last_Errno: 1062
                         Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 2307
                    Relay_Log_Space: 1242
                    Until_Condition: None
                     Until_Log_File: 
      greatsql> select * from performance_schema.replication_applier_status_by_worker limit 1\G
      *************************** 1. row ***************************
                                                 CHANNEL_NAME: 
                                                    WORKER_ID: 1
                                                    THREAD_ID: NULL
                                                SERVICE_STATE: OFF
                                            LAST_ERROR_NUMBER: 1062
                                           LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606; Could not execute Write_rows event on table test.test; Duplicate entry '5' for key 'test.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 2606
                                         LAST_ERROR_TIMESTAMP: 2023-08-08 16:03:39.033240
                                     LAST_APPLIED_TRANSACTION: 
           LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
          LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
               LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
                 LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
                                         APPLYING_TRANSACTION: e8bf88f9-2acd-11ee-a98a-00163e605c74:8
               APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
              APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
                   APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2023-08-08 16:03:39.032510
                       LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
         LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
        LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: 
      LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                           APPLYING_TRANSACTION_RETRIES_COUNT: 0
             APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
            APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: 
          APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
      1 row in set (0.00 sec)
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
      • 44
      • 45
      • 46
      • 47
      • 48
      • 49
      • 50
      • 51
      • 52
      • 53

      可以看到从节点报错发生了主键冲突。

      8.查看从节点定时任务状态

      当前从节点定时任务状态为 ENABLED

      greatsql> show events;
      +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
      | Db  | Name    | Definer | Time zone | Type    | Execute at | Interval value | Interval field | Starts        | Ends | Status  | Originator | character_set_client | collation_connection | Database Collation |
      +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
      | test | event_test | root@%  | SYSTEM   | RECURRING | NULL    | 1        | MINUTE     | 2023-08-08 15:58:45 | NULL | ENABLED |      1 | utf8mb4        | utf8mb4_unicode_ci  | utf8mb4_unicode_ci |
      +------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+---------------------+----------------------+--------------------+
      1 row in set (0.00 sec)
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6

      可以看到由于从节点的定时任务也执行了,从节点写入数据,导致主键冲突。

      9.故障解决

      greatsql> alter event event_test DISABLE;
      Query OK, 0 rows affected (0.01 sec)
      • 1

      关闭从节点的定时任务event,然后跳过主键冲突的报错,最后重新启动主从复制。

      3.总结

      1.如果主库有定时任务,通过clone的方式搭建从库,在从库恢复之后需要关闭定时任务,避免主从同时执行定时任务导致主从故障。

      2.克隆时,如果捐赠节点有主从复制信息,则克隆后的接收节点也会克隆此复制信息,并在克隆完成自动重启实例后,自动启动复制;避免此问题可以在接收节点的配置文件中增加 skip-slave-start,避免节点重启后自动启动复制。


      Enjoy GreatSQL :)

      关于 GreatSQL

      GreatSQL是适用于金融级应用的国内自主开源数据库,具备高性能、高可靠、高易用性、高安全等多个核心特性,可以作为MySQL或Percona Server的可选替换,用于线上生产环境,且完全免费并兼容MySQL或Percona Server。

      相关链接: GreatSQL社区 Gitee GitHub Bilibili

      GreatSQL社区:

      image

      社区有奖建议反馈: https://greatsql.cn/thread-54-1-1.html

      社区博客有奖征稿详情: https://greatsql.cn/thread-100-1-1.html

      (对文章有疑问或者有独到见解都可以去社区官网提出或分享哦~)

      技术交流群:

      微信&QQ群:

      QQ群:533341697

      微信群:添加GreatSQL社区助手(微信号:wanlidbc )好友,待社区助手拉您进群。

    • 相关阅读:
      Springboot 跨域 session 不一致的问题
      深入浅出 多线程编程(2)
      cJson堆内存释放问题
      windows消息的处理过程 消息队列
      前端常用操作(一)
      服务器数据恢复—VMware虚拟化下误操作导致服务器崩溃的数据恢复案例
      WinccOA 学习笔记
      C语言解决约瑟夫环问题
      gson的工具类JsonUtils
      Protocol Buffers语法
    • 原文地址:https://blog.csdn.net/GreatSQL2021/article/details/134028173