• PG14归档失败解决办法archiver failed on wal_lsn


    问题描述

    昨晚Repmgr+PG14主备主库因wal日志撑爆磁盘,删除主库过期wal文件重做备库后上午进行主备状态巡查,主库向备库发送wal文件正常,但是查主库状态时发现显示有1条归档失败的记录。
    postgres: archiver failed on 000000010000006F00000086

    • 主库:

    walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" walsender正常
    archiver failed on 000000010000006F00000086" 归档失败

    • 备库:

    walreceiver streaming 77/9EB6A198" “” “” " walreceiver正常

    --查主库数据库状态
    [root@pgmaster ~]# systemctl status postgres
    ● postgres.service - PostgreSQL database server
    Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
    Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
    Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
    Main PID: 3710970 (postgres)
    Tasks: 53 (limit: 201967)
    Memory: 19.0G
    CGroup: /system.slice/postgres.service
    ├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
    ├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
    ├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
    ├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
    ├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
    ├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
    ├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常
    
    --查备库状态
    [root@pgslave ~]# systemctl status postgres
    ● postgres.service - PostgreSQL database server
    Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
    Active: active (running) since Fri 2023-10-13 00:12:19 CST; 12h ago
    Process: 1931221 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
    Main PID: 1931223 (postgres)
    Tasks: 7 (limit: 201967)
    Memory: 23.2G
    CGroup: /system.slice/postgres.service
    ├─ 1931223 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
    ├─ 1931224 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 1931225 "postgres: startup recovering 00000001000000770000009E" "" "" "" "" "" "" "" "" ""
    ├─ 1931226 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 1931227 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 1931230 "postgres: walreceiver streaming 77/9EB6A198" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""   #wal接收
    └─ 1931430 "postgres: repmgr repmgr 172.28.32.23(22956) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    
    Oct 13 00:12:17 pgslave systemd[1]: Starting PostgreSQL database server...
    Oct 13 00:12:17 pgslave pg_ctl[1931221]: waiting for server to start....
    Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] LOG:  redirecting log output to logging collector process
    Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] HINT:  Future log output will appear in directory "log".
    Oct 13 00:12:19 pgslave pg_ctl[1931221]: . done
    Oct 13 00:12:19 pgslave pg_ctl[1931221]: server started
    Oct 13 00:12:19 pgslave systemd[1]: Started PostgreSQL database server.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49

    问题分析

    1.查看数据库日志

    在这里插入图片描述

    2.查看归档配置参数

    参数配置正确,归档目录权限也正确

    postgres=# show archive_command;
                          archive_command                      
    -----------------------------------------------------------
     /usr/bin/lz4 -q -z %p /server/data/pgdb/pg_archive/%f.lz4
    (1 row)
    
    postgres=# show archive_mode;
     archive_mode 
    --------------
     on
    (1 row)
    
    --查看归档目录的权限
    [postgres@pgmaster ~]$ ls -ld /server/data/pgdb/pg_archive
    drwxr-x--- 2 postgres postgres 4214784 Oct 13 13:14 /server/data/pgdb/pg_archive
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    3.手动切日志

    手工归档成功,但是未解决,查看状态依然时卡住归档失败的那条wal记录那里

    --手工归档
    top_portal=# select pg_switch_wal();
     pg_switch_wal 
    ---------------
     72/51C4CFD8
    (1 row)
    
    --查主库数据库状态
    [root@pgmaster ~]# systemctl status postgres
    ● postgres.service - PostgreSQL database server
    Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
    Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
    Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
    Main PID: 3710970 (postgres)
    Tasks: 53 (limit: 201967)
    Memory: 19.0G
    CGroup: /system.slice/postgres.service
    ├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
    ├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
    ├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
    ├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
    ├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
    ├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
    ├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
    ├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常
    
    
    --查当前wal_lsn
    top_portal=# select pg_current_wal_lsn();
     pg_current_wal_lsn 
    --------------------
     72/52638F10
    (1 row)
    
    --查当前wal_lsn对应的wal文件
    top_portal=# select pg_walfile_name(pg_current_wal_lsn());
         pg_walfile_name      
    --------------------------
     000000010000007200000052
    (1 row)
    
    --查当前最新检查点,最新检查点之前的wal文件均可以删除
    [postgres@pgmaster ~]$ pg_controldata $PGDATA
    pg_control version number:            1300
    Catalog version number:               202107181
    Database system identifier:           7268852449124462799
    Database cluster state:               in production
    pg_control last modified:             Fri 13 Oct 2023 10:07:35 AM CST  
    Latest checkpoint location:           71/CDD2FF28
    Latest checkpoint's REDO location:    71/CDD28F18
    Latest checkpoint's REDO WAL file:    0000000100000071000000CD
    
    --查报错中的wal文件
    [postgres@pgmaster pg_wal]$ ls -l 000000010000006F00000086
    -rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
    [postgres@pgmaster pg_wal]$ find /server/data/pgdb/pg_archive -name 000000010000006F00000086*
    ls: cannot access '000000010000006F00000086': No such file or directory
    [postgres@pgmaster pg_wal]$ find /server -name 000000010000006F00000086*
    -rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64

    4.检查$PGDATA/pg_wal/archive_status/目录下文件

    [postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
    [postgres@pgmaster archive_status]$ ls -l *.ready
    ls: cannot access '*.ready': No such file or directory
    
    说明不存在需要归档但没归档的文件
    
    • 1
    • 2
    • 3
    • 4
    • 5

    该目录下,ready说明是需要归档但是没归档的,done是归档完成了的

    解决办法

    1.将归档失败的wal文件备份到/home/postgres目录下(生产环境如果磁盘空间允许切记不要rm删除,mv备份到目标位置)
    2.手工归档select pg_switch_wal();
    3.再次查看主备库状态

    --1.将归档失败的wal文件备份到/home/postgres目录下
    [postgres@pgmaster pg_wal]$ mv 000000010000006F00000086 /home/postgres/000000010000006F00000086
    [postgres@pgmaster pg_wal]$ ls -l /home/postgres/000000010000006F00000086
    -rw------- 1 postgres postgres 16777216 Oct 12 21:12 /home/postgres/000000010000006F00000086
    
    --2.手工归档
    postgres=# select pg_switch_wal();
     pg_switch_wal 
    ---------------
     73/7EF502E0
    (1 row)
    
    --3.再次查看主库状态显示正常
    [root@pgmaster data]# systemctl status postgres
    ● postgres.service - PostgreSQL database server
         Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
         Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
        Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
       Main PID: 3710970 (postgres)
          Tasks: 50 (limit: 201967)
         Memory: 26.6G
         CGroup: /system.slice/postgres.service
                 ├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
                 ├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3710995 "postgres: archiver archiving 000000010000007100000035" "" "" "" "" "" "" "" "" ""
                 ├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
                 ├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
                 ├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
                 ├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
                 ├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
                 ├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 73/7F000BD0"
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35

    补充:若$PGDATA/pg_wal/archive_status/目录下存在大量的*.ready文件

    可能的原因分析:如果数据库是突然断电,那么可能arvchive命令没有完全完成,归档目录会存在不完整的文件名称,重启数据库后,会出现归档失败的情况,这个时候,需要去归档目录删除相关归档失败文件,那么归档就会重新归档。
    还未遇到该场景的问题,暂未实验。
    参考链接

  • 相关阅读:
    python项目requirements.txt项目用到哪些库哪些版本
    练手必备!Python编程实战—23个有趣的实战项目带你快速进阶
    并发之固定运行和交替运行方案
    《Python 密码学编程》读书笔记(2)
    图片隐写,盲水印,加密logo
    衍三的硬件笔记之如何选择MOS管
    (免费分享)java基于SSM的进销存管理系统设计与实现
    pytorch lightning最简上手
    独立产品灵感周刊 DecoHack #038 - 纽约市 90 年代的街景长什么样
    大数据之Hive(三)
  • 原文地址:https://blog.csdn.net/qq961573863/article/details/133809496