• KingbbaseES V8R6集群维护案例之---集群之间数据迁移


    案例说明:
    生产环境是集群环境,测试环境是集群,现需要将生产环境的数据迁移到测试集群中运行,本文档详细介绍了从集群环境迁移数据的操作步骤,可以作为生产环境迁移数据的参考。

    适用版本: KingbaseES V8R6

    本案例数据库版本(集群使用相同的版本):

    1. test=# select version();
    2. version
    3. ----------------------------------------------------------------------------------------------------------------------
    4. KingbaseES V008R006C005B0041 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
    5. (1 row)

    生产集群节点信息:

    1. [kingbase@node1 bin]$ ./repmgr cluster show
    2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    3. ----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    4. 1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.1.201 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    5. 2 | node2 | standby | running | node1 | default | 100 | 1 | host=192.168.1.202 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

    测试集群节点信息:

    1. [kingbase@node101 bin]$ ./repmgr cluster show
    2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    3. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    4. 1 | node101 | primary | * running | | default | 100 | 13 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    5. 2 | node102 | standby | running | node101 | default | 100 | 13 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

    一、生产环境迁移数据前的准备

    1、生产环境数据信息

    1. prod=# \l
    2. List of databases
    3. Name | Owner | Encoding | Collate | Ctype | Access privileges
    4. -----------+--------+----------+----------+-------------+-------------------
    5. esrep | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    6. prod | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    7. prod1 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    8. prod2 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    9. security | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    10. template0 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 | =c/system +
    11. | | | | | system=CTc/system
    12. template1 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 | =c/system +
    13. | | | | | system=CTc/system
    14. test | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    15. (8 rows)
    16. prod=# select count(*) from t2;
    17. count
    18. --------
    19. 100000
    20. (1 row)

    2、关闭生产集群

    1. [kingbase@node1 bin]$ ./sys_monitor.sh stop
    2. 2022-06-20 16:00:46 Ready to stop all DB ...
    3. .......
    4. 2022-06-20 16:01:02 DB on "[192.168.1.202]" stop success.
    5. 2022-06-20 16:01:02 begin to stop DB on "[192.168.1.101]".
    6. waiting for server to shut down....... done
    7. server stopped
    8. 2022-06-20 16:01:06 DB on "[192.168.1.201]" stop success.
    9. 2022-06-20 16:01:06 Done.

    二、迁移生产数据到测试环境

    Tips:
    1)将生产数据迁移到集群,需要停止生产数据库服务,根据data目录数据的大小,要估算停机窗口时间。
    2)在生产数据库前,建议手工创建检查点,如果wal日志比较大,建议备份后,清理wal日志,只需要保留最近一天的日志到最近检查点后即可。
    3)需要跨主机将生产库主库data目录拷贝到集群的主备库节点,需 根据网络带宽和节点数,估算整个拷贝时间。

    1、关闭测试集群

    1. [kingbase@node101 bin]$ ./sys_monitor.sh stop
    2. 2022-06-20 16:10:46 Ready to stop all DB ...
    3. 2022-06-20 16:11:02 DB on "[192.168.1.102]" stop success.
    4. 2022-06-20 16:11:02 begin to stop DB on "[192.168.1.101]".
    5. waiting for server to shut down....... done
    6. server stopped
    7. 2022-06-20 16:11:06 DB on "[192.168.1.101]" stop success.
    8. 2022-06-20 16:11:06 Done.

    2、将测试库数据备份[kingbase@node101 kingbase]$ mv data data.bk

    3、拷贝生产集群主库data到测试集群主备库(所有节点)

    1)拷贝生产数据到测试库

    1. [kingbase@node1 kingbase]$ scp -r data node101:/home/kingbase/cluster/R6HA/kha/kingbase/
    2. [kingbase@node1 kingbase]$ scp -r data node102:/home/kingbase/cluster/R6HA/kha/kingbase/

    2)备库创建standby.signal[kingbase@node102 data]$ touch standby.signal

    3)复制测试集群的数据库配置文件:(所有节点)

    1. [kingbase@node101 data]$ cp ../data.bk/kingbase.auto.conf ./
    2. [kingbase@node101 data]$ cp ../data.bk/kingbase.conf ./
    3. [kingbase@node101 data]$ cp ../data.bk/es_rep.conf ./
    4. # 查看kingbase.auto.conf
    5. [kingbase@node101 data]$ cat kingbase.auto.conf
    6. # Do not edit this file manually!
    7. # It will be overwritten by the ALTER SYSTEM command.
    8. enable_upper_colname = 'on'
    9. wal_retrieve_retry_interval = '5000'
    10. primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
    11. recovery_target_timeline = 'latest'
    12. primary_slot_name = 'repmgr_slot_1'
    13. synchronous_standby_names = ''

    三、重新注册集群节点

    Tips:
    因为data数据中存储的是原生产集群的节点信息(esrep库),所以要根据测试库的repmgr.conf文件重新注册节点。

    1、启动主备库数据库服务

    1. [kingbase@node101 bin]$ ./sys_ctl restart -D ../data
    2. waiting for server to shut down.... done
    3. .......
    4. server started

    2、查看节点状态信息

    1. [kingbase@node101 bin]$ ./repmgr cluster show
    2. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    3. ----+-------+---------+---------------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    4. 1 | node1 | primary | ? unreachable | | default | 100 | ? | host=192.168.1.201 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    5. 2 | node2 | standby | - failed | node1 | default | 100 | ? | host=192.168.1.202 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    6. WARNING: following issues were detected
    7. - unable to connect to node "node1" (ID: 1)
    8. - node "node1" (ID: 1) is registered as an active primary but is unreachable
    9. - unable to connect to node "node2" (ID: 2)
    10. # 如上所示,因为repmgr.conf和esrep库的注册信息不一致,现在集群节点处于非正常状态。

    3、重新注册主备库1)注册主库

    1. [kingbase@node101 bin]$ ./repmgr primary register --force
    2. INFO: connecting to primary database...
    3. INFO: "repmgr" extension is already installed
    4. NOTICE: primary node record (ID: 1) updated
    5. [kingbase@node101 bin]$ ./repmgr cluster show
    6. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    7. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    8. 1 | node101 | primary | * running | | default | 100 | 1 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    9. 2 | node2 | standby | - failed | node101 | default | 100 | ? | host=192.168.1.202 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    10. WARNING: following issues were detected
    11. - unable to connect to node "node2" (ID: 2)

    2)注册备库

    1. [kingbase@node102 bin]$ ./repmgr standby register --force
    2. INFO: connecting to local node "node102" (ID: 2)
    3. INFO: connecting to primary database
    4. INFO: standby registration complete
    5. NOTICE: standby node "node102" (ID: 2) successfully registered
    6. [kingbase@node102 bin]$ ./repmgr cluster show
    7. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
    8. ----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
    9. 1 | node101 | primary | * running | | default | 100 | 1 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    10. 2 | node102 | standby | running | node101 | default | 100 | 1 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
    11. # 如上所示,节点注册完成后,集群节点状态正常。

    四、查看流复制状态

    Tips:
    如果生产集群和测试集群的流复制复制槽名称不一致,可能需要重建复制槽。

    1. # 复制槽信息
    2. test=# select * from sys_replication_slots;
    3. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
    4. ---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
    5. repmgr_slot_2 | | physical | | | f | t | 26620 | 963 | | 0/4E000FB8 |
    6. repmgr_slot_1 | | physical | | | f | f | | | | |
    7. (2 rows)
    8. # 流复制状态信息
    9. test=# select * from sys_stat_replication;
    10. pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
    11. -------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+----
    12. 26620 | 10 | system | node102 | 192.168.1.102 | | 49454 | 2022-06-20 17:08:15.347515+08 | | streaming | 0/4E000FB8 | 0/4E000FB8 | 0/4E000FB8 | 0/4E000FB8 | | | | 1 | sync | 2022-06-20 17:11:44.277666+08
    13. (1 row)

    五、验证数据

    1、查看迁移后数据

    1. test=# \l
    2. List of databases
    3. Name | Owner | Encoding | Collate | Ctype | Access privileges
    4. -----------+--------+----------+----------+-------------+-------------------
    5. esrep | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    6. prod | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    7. prod1 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    8. prod2 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    9. security | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    10. template0 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 | =c/system +
    11. | | | | | system=CTc/system
    12. template1 | system | UTF8 | ci_x_icu | zh_CN.UTF-8 | =c/system +
    13. | | | | | system=CTc/system
    14. test | system | UTF8 | ci_x_icu | zh_CN.UTF-8 |
    15. (8 rows)
    16. prod=# select count(*) from t2;
    17. count
    18. --------
    19. 100000
    20. (1 row)

    2、重启集群

    1. [kingbase@node101 bin]$ ./sys_monitor.sh restart
    2. 2022-06-20 17:12:39 Ready to stop all DB ...
    3. .......
    4. 2022-06-20 17:13:14 repmgrd on "[192.168.1.102]" start success.
    5. ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
    6. ----+---------+---------+-----------+----------+---------+-------+---------+--------------------
    7. 1 | node101 | primary | * running | | running | 28907 | no | n/a
    8. 2 | node102 | standby | running | node101 | running | 26280 | no | 2 second(s) ago
    9. [2022-06-20 17:13:19] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/log/kbha.log"
    10. [2022-06-20 17:13:22] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/kha/kingbase/log/kbha.log"
    11. 2022-06-20 17:13:29 Done.

    六、总结

    1. 1、从集群环境迁移数据,如果需要保证数据一致,必须要将集群停库,对于生产环境,要考虑停机窗口。
    2. 2、如果需要将目标集群数据重新加载到新的集群,需要将目标集群数据做逻辑备份,但是在导入时如果有重复数据需注意处理。(如将测试数据再导入到迁移后的集群中,可能有许多数据会重复)。
    3. 3、申请停机窗口,要考虑源集群数据量的大小、主机间的网络带宽、集群节点数、集群配置时间、集群启动故障的处理时间等。
  • 相关阅读:
    无法通过ssh连接到主机:root@node1:权限被拒绝
    私域电商模式全解析:掌握这些方法,让你的生意不再难做!
    基于风险评估标准内容演变理解安全运营工作本质
    【Leetcode】745. Prefix and Suffix Search
    【病害识别】基于matlab随机森林苹果病害识别【含Matlab源码 2211期】
    音视频从入门到精通——FFmpeg结构体:AVStream分析
    时序预测 | MATLAB实现ARIMA时间序列预测(GDP预测)
    Mooctest
    亚马逊鲲鹏系统可全自动化批量操作亚马逊买家号
    银河麒麟服务器系统使用的一些问题和解决方案
  • 原文地址:https://blog.csdn.net/lyu1026/article/details/125457220