• 记录近期修复Dataguard的过程


    参考文档:
    Bug 8656212 - Alert log with message "Warning: recovery process XXXX cannot use async I/O" (Doc ID 8656212.8)
    https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/V-MANAGED_STANDBY.html

    RDBMS 11.2.0.4 的主备库。
    发生网络或者存储方面的故障后,DG出现一些问题。


    问题1 
    主库RAC,备库RAC,主备库不同步,主库上不再显示LNS服务进程相关的信息。也就是主库不再发送日志到备库。

    解决方法:
    重启备库RAC,重启后,主库的alert log中提示'LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2'
    随后,继续传输到备库,备库media recover 后正常。这个比较奇怪,什么原因,不是很清楚。

    1. Beginning log switch checkpoint up to RBA [0x21db2.2.10], SCN: 6255905417984
    2. Thread 1 advanced to log sequence 138674 (LGWR switch)
    3. Current log# 1 seq# 138674 mem# 0: +DATA_ABC/prod/onlinelog/group_1.296.123456
    4. Current log# 1 seq# 138674 mem# 1: +RECO_DEF/prod/onlinelog/group_1.273.123456
    5. Mon Nov 21 09:48:08 2022
    6. ******************************************************************
    7. LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
    8. ******************************************************************
    9. LNS: Standby redo logfile selected for thread 1 sequence 138674 for destination LOG_ARCHIVE_DEST_2
    10. Mon Nov 21 09:48:20 2022
    11. Archived Log entry 500862 added for thread 1 sequence 138673 ID 0x14184331 dest 1:
    12. Mon Nov 21 09:50:08 2022
    13. FAL[server, ARC0]: FAL archive failed, see trace file.
    14. ARCH: FAL archive failed. Archiver continuing
    15. ORACLE Instance PROD1 - Archival Error. Archiver continuing.
    16. Mon Nov 21 09:51:04 2022

    问题2 
    主库RAC,备库RAC ,主备库不同步,主库上有部分日志没有发送过去,逐个手工copy到备库注册后,可以media recover。但是发现主库上切换了日志后,几乎备库接收不到。
    重启备库的MRP进程后,提示'ORA-27090: 无法为异步磁盘 I/O 保留内核资源 '等告警,如下:

    1. Mon Nov 21 10:24:32 2022
    2. Errors in file /u01/ABCDG/oracle/diag/rdbms/ABCDG/ab/trace/ab_pr0m_26593.trc:
    3. ORA-27090: 无法为异步磁盘 I/O 保留内核资源
    4. Additional information: 3
    5. Additional information: 128
    6. Additional information: 166549176
    7. Warning: recovery process PR0M cannot use async I/O
    8. Mon Nov 21 10:24:32 2022
    9. Errors in file /u01/ABCDG/oracle/diag/rdbms/ABCDG/ab/trace/ab_pr0n_26595.trc:
    10. ORA-27090: 无法为异步磁盘 I/O 保留内核资源
    11. Additional information: 3
    12. Additional information: 128
    13. Additional information: 166549176
    14. Warning: recovery process PR0N cannot use async I/O
    15. Mon Nov 21 10:24:32 2022

    继续查看对应的trc文件,如下:(可以看到KCBR: Number of read descriptors = 32)

    1. ******************* End of process map dump ************
    2. ----- Process Resource Limits -----
    3. ***************** Dumping Resource Limits(s/h) *****************
    4. core file size 0 KB/UNLIMITED
    5. data seg size UNLIMITED/UNLIMITED
    6. file size UNLIMITED/UNLIMITED
    7. pending signals 2010 KB/2010 KB
    8. max locked memory 238 GB/238 GB
    9. max memory size UNLIMITED/UNLIMITED
    10. open files 64 KB/64 KB
    11. POSIX message queues 800 KB/800 KB
    12. stack size 32 MB/UNLIMITED
    13. cpu time UNLIMITED/UNLIMITED
    14. max user processes 16 KB/16 KB
    15. virtual memory UNLIMITED/UNLIMITED
    16. file locks UNLIMITED/UNLIMITED
    17. ***************** End of Resource Limits Dump ******************
    18. ----- Java Stack -----
    19. ----- VKTM Time Drifts Circular Buffer -----
    20. Warning: recovery process PR0M cannot use async I/O
    21. KCBR: Number of read descriptors = 32
    22. *** 2022-11-21 14:10:50.276
    23. KCBR: Media recovery blocks read (SYNC) = 180651
    24. KCBR: Influx buffers flushed = 10145 times
    25. KCBR: Redo cache copies/changes = 1100946/1100943

    查看归档日志情况,发现MRP0的状态是WAIT_FOR_LOG

    1. SYS@ab>select process,status,thread#,sequence# from v$managed_standby order by 3,1
    2. 2 ;
    3. PROCESS STATUS THREAD# SEQUENCE#
    4. ------------------ ------------------------ ---------- ----------
    5. RFS IDLE 0 0
    6. RFS IDLE 0 0
    7. RFS IDLE 0 0
    8. RFS IDLE 0 0
    9. RFS IDLE 0 0
    10. RFS IDLE 0 0
    11. ARCH CLOSING 1 429441
    12. ARCH CLOSING 1 418157
    13. ARCH CLOSING 1 429440
    14. ARCH CLOSING 1 429437
    15. RFS IDLE 1 429442
    16. PROCESS STATUS THREAD# SEQUENCE#
    17. ------------------ ------------------------ ---------- ----------
    18. MRP0 WAIT_FOR_LOG 2 381566
    19. 12 rows selected.

    根据官方文档中对v$managed_standby的解释:
    WAIT_FOR_LOG - Process is waiting for the archived redo log to be completed  (进程等待归档redo完成)
    所以,就发生了归档日志一直传不过来,或者归档日志一直不应用。

    解决方法:
    根据MOS文档Bug 8656212 - Alert log with message "Warning: recovery process XXXX cannot use async I/O" (Doc ID 8656212.8)
    设置alter system set "_media_recovery_read_batch" = 64 sid ='*' scope = spfile;

    以下为MOS中的描述,从MOS中的描述,可以看到在某些情况下同步和异步IO会转换 
    Description
    Alert log can show the next message during recovery as the recovery processes
    may not use ASYNC IO:
     
    "Warning: recovery process XXXX cannot use async I/O"
     
    Note that to confirm that async I/O is being used and in the case of 
    recovery in a standby database, look for the following trace line in 
    the MRP0 trace file after managed recovery has been started and stopped 
    (the trace is printed upon completion of a recovery session):
     
    Example:
     
    Read rate (ASYNC): 11070353Kb in 123.79s => 87.33 Mb/sec
     
    If sync I/O then SYNC will be displayed instead of ASYNC.
     
    Workaround
     
    Set _media_recovery_read_batch to 64 or 100    


    设置完毕后,重启数据库,打开MRP进程,归档日志的传输和应用正常,查询v$managed_standby结果为APPLYING_LOG


    问题3 
    主库RAC,备库RAC,主库的节点1可以传输归档日志到备库,节点2开始报以下错误。节点2无法传送日志到备库。同时在主库上ping备库,是没有问题的。

    1. Error 12514 received logging on to the standby
    2. PING[ARC1]: Heartbeat failed to connect to standby 'mno'. Error is 12514.
    3. Mon Nov 21 10:20:53 2022
    4. Error 1017 received logging on to the standby
    5. ------------------------------------------------------------
    6. Check that the primary and standby are using a password file
    7. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    8. and that the SYS password is same in the password files.
    9. returning error ORA-16191
    10. ------------------------------------------------------------
    11. PING[ARC1]: Heartbeat failed to connect to standby 'mno'. Error is 16191.
    12. Mon Nov 21 10:21:54 2022
    13. Error 1017 received logging on to the standby
    14. ------------------------------------------------------------

    重启MRP进程后,也报异步IO方面的错误。

    1. Errors in file /u01/AAADG/db/diag/rdbms/mno/mno1/trace/mno1_pr0t_5157.trc:
    2. ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
    3. Additional information: 3
    4. Additional information: 128
    5. Additional information: 166549176
    6. Warning: recovery process PR0T cannot use async I/O
    7. Mon Nov 21 11:27:07 2022
    8. Errors in file /u01/AAADG/db/diag/rdbms/mno/mno1/trace/mno1_pr0u_5159.trc:
    9. ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
    10. Additional information: 3
    11. Additional information: 128
    12. Additional information: 166549176
    13. Warning: recovery process PR0U cannot use async I/O

    通过set "_media_recovery_read_batch" = 64 后解决异步IO的问题,随后启动数据库和MRP进程。alert 中提示如下:

    通过以下alert log中查看,貌似密码文件问题(比较奇怪的是,数据库运行了很久了,重启后出现这个问题)

    1. ------------------------------------------------------------
    2. Check that the primary and standby are using a password file
    3. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    4. and that the SYS password is same in the password files.
    5. returning error ORA-16191
    6. ------------------------------------------------------------
    7. Suppressing further error logging of LOG_ARCHIVE_DEST_2.
    8. Mon Nov 21 14:31:49 2022
    9. Thread 2 advanced to log sequence 1420 (LGWR switch)
    10. Current log# 4 seq# 1420 mem# 0: +DATA/prod/redo04.log
    11. Mon Nov 21 14:31:49 2022
    12. Archived Log entry 4772 added for thread 2 sequence 1419 ID 0x1da8a7e5 dest 1:
    13. Mon Nov 21 14:31:50 2022
    14. Error 1017 received logging on to the standby
    15. ------------------------------------------------------------
    16. Check that the primary and standby are using a password file
    17. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    18. and that the SYS password is same in the password files.
    19. returning error ORA-16191
    20. ------------------------------------------------------------
    21. Suppressing further error logging of LOG_ARCHIVE_DEST_2.
    22. FAL[server, ARC3]: FAL archive failed, see trace file.
    23. ARCH: FAL archive failed. Archiver continuing
    24. ORACLE Instance prod2 - Archival Error. Archiver continuing.
    25. ^C

    从arc3进程中看,貌似节点2登录备库拒绝 

    1. [oracle@AAAdb2 trace]$ ls -lrt *arc3*
    2. Initial buffer sizes: read 1024K, overflow 832K, change 805K
    3. OCISessionBegin failed -1
    4. *** 2022-11-21 00:21:56.100
    5. .. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
    6. '
    7. OCISessionBegin failed. Error -1
    8. .. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
    9. '
    10. *** 2022-11-21 00:21:57.136
    11. OCISessionBegin failed. Error -1
    12. .. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
    13. '
    14. OCISessionBegin failed. Error -1
    15. .. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
    16. '
    17. *** 2022-11-21 00:21:57.158 4329 krsh.c
    18. Error 1017 received logging on to the standby
    19. ------------------------------------------------------------
    20. Check that the primary and standby are using a password fi

    查看节点1和节点2上的密码文件,发现节点2上只有一个密码文件,随后,将节点1上的密码文件copy到节点1上。

    1. [oracle@AAAdb2 dbs]$ ls -lrt orapw*
    2. -rw-r----- 1 oracle oinstall 1536 Mar 2 2022 orapwprod2 --- 随后修改的这个文件
    3. [oracle@AAAdb2 dbs]$
    4. [oracle@AAAdb1 dbs]$ ls -lrt orapw*
    5. -rw-r----- 1 oracle oinstall 1536 Mar 2 2022 orapwprod1
    6. -rw-r----- 1 oracle oinstall 1536 Mar 3 2022 orapwprod2
    7. [oracle@AAAdb1 dbs]$

    密码文件拷贝后,alert log中立刻显示可以发送日志了

    1. Check that the primary and standby are using a password file
    2. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    3. and that the SYS password is same in the password files.
    4. returning error ORA-16191
    5. ------------------------------------------------------------
    6. Mon Nov 21 15:05:01 2022
    7. Error 1017 received logging on to the standby
    8. ------------------------------------------------------------
    9. Check that the primary and standby are using a password file
    10. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    11. and that the SYS password is same in the password files.
    12. returning error ORA-16191
    13. ------------------------------------------------------------
    14. Mon Nov 21 15:06:02 2022
    15. Error 1017 received logging on to the standby
    16. ------------------------------------------------------------
    17. Check that the primary and standby are using a password file
    18. and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
    19. and that the SYS password is same in the password files.
    20. returning error ORA-16191
    21. ------------------------------------------------------------
    22. Mon Nov 21 15:07:05 2022
    23. Thread 2 advanced to log sequence 1421 (LGWR switch)
    24. Current log# 3 seq# 1421 mem# 0: +DATA/prod/redo03.log
    25. Mon Nov 21 15:07:05 2022
    26. LNS: Standby redo logfile selected for thread 2 sequence 1421 for destination LOG_ARCHIVE_DEST_2
    27. Mon Nov 21 15:07:05 2022
    28. Archived Log entry 4774 added for thread 2 sequence 1420 ID 0x1da8a7e5 dest 1:
    29. Mon Nov 21 15:07:05 2022
    30. ARC3: Standby redo logfile selected for thread 2 sequence 1420 for destination LOG_ARCHIVE_DEST_2

    总结:
    前2个问题比较奇怪,通过业务了解,确实是底层的存储出了一些问题。
    最后一个密码问题,也比较奇怪,密码不一致,运行了很久才暴露。

    END 

    --20221122 add

    关于LNS进程消失问题,今天看了一下日志,应该是在网络和存储故障的时候,log_archive_dest_2对应的进程被kill掉,然后相关的LNS进程也消失(或不再传输归档日志到备库)

    日志信息如下:可以看到timeout,以及Hung IO 在log_archive_dest_2 

    1. TNS-00505: Operation timed out
    2. nt secondary err code: 0
    3. nt OS err code: 0
    4. Client address: <unknown>
    5. Sun Nov 20 15:09:42 2022
    6. WARN: ARC2: Terminating pid 339978 hung on an I/O operation
    7. Killing 1 processes with pids 339978 (Process by index) in order to remove hung processes. Requested by OS process 180454 on instance 1
    8. ARC2: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_2
    9. ARC2: Detected ARCH process failure
    10. ARC2: STARTING ARCH PROCESSES
    11. Sun Nov 20 15:09:44 2022
    12. ARC1 started with pid=36, OS id=361767
    13. ARC1: Archival started
    14. ARC2: STARTING ARCH PROCESSES COMPLETE
    15. Reclaiming FAL entry from dead process [pid 339978]
    16. Sun Nov 20 15:09:44 2022
    17. ARC0: Becoming the heartbeat ARCH
    18. Sun Nov 20 15:10:07 2022
    1. Error 12170 received logging on to the standby
    2. Error 12170 for archive log file 2 to 'ERPDG'
    3. Errors in file /u01/PROD/db/tech_st/11.2.0/admin/PROD_erpdb/diag/rdbms/prod/PROD1/trace/PROD1_nsa2_300418.trc:
    4. ORA-12170: TNS:Connect timeout occurred
    5. Sun Nov 20 15:16:44 2022
    6. WARN: ARC2: Terminating pid 354694 hung on an I/O operation
    7. Killing 1 processes with pids 354694 (Process by index) in order to remove hung processes. Requested by OS process 180454 on instance 1
    8. ARC2: Detected ARCH process failure
    9. ARC2: STARTING ARCH PROCESSES
    10. Sun Nov 20 15:16:47 2022
    11. ARC0 started with pid=61, OS id=368828
    12. ARC0: Archival started
    13. ARC2: STARTING ARCH PROCESSES COMPLETE
    14. Reclaiming FAL entry from dead process [pid 354694]
    15. Sun Nov 20 15:16:47 2022
    16. ARC3: Becoming the heartbeat ARCH
    17. Sun Nov 20 15:18:48 2022

    更加详细的trc文件,显示有网络方面的timeout 信息 。

    1. NSA2: initializing for LGWR communication
    2. NSA2: connecting to KSR channel
    3. Success
    4. NSA2: subscribing to KSR channel
    5. Success
    6. *** 2022-11-20 14:14:25.845 1335 krsw.c
    7. NSA2: initialized successfully ASYNC=1
    8. Destination is specified with ASYNC=20480
    9. *** 2022-11-20 15:10:07.549
    10. OCIServerAttach failed -1
    11. .. Detailed OCI error val is 12170 and errmsg is 'ORA-12170: TNS:Connect timeout occurred
    12. '
    13. *** 2022-11-20 15:12:07.666
    14. OCIServerAttach failed -1
    15. .. Detailed OCI error val is 12170 and errmsg is 'ORA-12170: TNS:Connect timeout occurred
    16. '
    17. *** 2022-11-20 15:14:07.780
    18. OCIServerAttach failed -1
    19. .. Detailed OCI error val is 12170 and errmsg is 'ORA-12170: TNS:Connect timeout occurred
    20. '
    21. *** 2022-11-20 15:16:07.881
    22. OCIServerAttach failed -1
    23. .. Detailed OCI error val is 12170 and errmsg is 'ORA-12170: TNS:Connect timeout occurred
    24. '
    25. *** 2022-11-20 15:16:07.881 4346 krsh.c
    26. Error 12170 received logging on to the standby
    27. *** 2022-11-20 15:16:07.881 947 krsu.c
    28. Error 12170 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ERPDG'
    29. Error 12170 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'ERPDG'
    30. *** 2022-11-20 15:16:07.881 4346 krsh.c
    31. Error 12170 for archive log file 2 to 'ERPDG'
    32. *** 2022-11-20 15:16:07.881 2969 krsi.c
    33. krsi_dst_fail: dest:2 err:12170 force:0 blast:1
    34. ORA-12170: TNS:Connect timeout occurred
    35. Destination is specified with ASYNC=20480
    36. *** 2022-11-20 15:36:49.680
    37. OCIServerAttach failed -1
    38. .. Detailed OCI error val is 12170 and errmsg is 'ORA-12170: TNS:Connect timeout occurred
    39. *** 2022-11-20 15:42:50.017 4346 krsh.c
    40. Error 12170 received logging on to the standby
    41. *** 2022-11-20 15:42:50.017 947 krsu.c
    42. Error 12170 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'ERPDG'
    43. Error 12170 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'ERPDG'
    44. *** 2022-11-20 15:42:50.017 4346 krsh.c
    45. Error 12170 for archive log file 5 to 'ERPDG'
    46. *** 2022-11-20 15:42:50.017 2969 krsi.c
    47. krsi_dst_fail: dest:2 err:12170 force:0 blast:1
    48. ORA-12170: TNS:Connect timeout occurred
    49. Destination is specified with ASYNC=20480

    END 

  • 相关阅读:
    月薪9K和年薪30W的职位,有什么区别?
    PostgreSQL 内核可观测性体系
    Kafka概述
    C# 使用Parallel去执行并行下载
    数字图像处理(九)双边滤波
    分析股票怎么进行量化交易?
    区块链的概念和特征
    python rb读取文件 base64加密 byte.decode解密,base64解密
    2309C++nlohmann数格示例2
    跨语言调用C#代码的新方式-DllExport
  • 原文地址:https://blog.csdn.net/xxzhaobb/article/details/127967833