我申请这个blog是为了督促自己,把自己平时的一些想法和思考结果保留下来。 本博客所有内容均为原创,如有转载请注明作者和出处

一次NBU备份错误诊断

上一篇 / 下一篇  2008-04-21 00:53:15 / 个人分类:ORACLE

在对系统进行例行检查的时候,发现日常备份失败。

 

 

错误信息为:

RMAN> backup incremental level 0 database;

Starting backup at 10-MAR-08ITPUB个人空间 H5P+N3yt
using target database controlfile instead of recovery catalogITPUB个人空间h4}}H(pU)b,ZY
allocated channel: ORA_SBT_TAPE_1
csU,~1o0channel ORA_SBT_TAPE_1: sid=120 devtype=SBT_TAPEITPUB个人空间1i5pI)Q4J4^"U _ H&T
channel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle - Release 5.0GA (2003103006)ITPUB个人空间gF%a*Hd!u0[+a
channel ORA_SBT_TAPE_1: starting incremental level 0 datafile backupsetITPUB个人空间 gw|'P-D.m|j
channel ORA_SBT_TAPE_1: specifying datafile(s) in backupset
}F"[XT?"n0input datafile fno=00001 name=/dev/vx/rdsk/maindbdg/lv_main00ITPUB个人空间-y'?N[R0N#v0L$o
input datafile fno=00008 name=/opt/oracle/oradata/oradata/bjdb01/users01.dbf
+q+fI![W0input datafile fno=00039 name=/opt/oracle/oradata/oradata/bjdb01/xdb02.dbf
O.~i;J U1])hM0input datafile fno=00009 name=/opt/oracle/oradata/oradata/bjdb01/xdb01.dbf
H?D&JENi(e0input datafile fno=00003 name=/opt/oracle/oradata/oradata/bjdb01/cwmlite01.dbfITPUB个人空间]~.R keJh3K
input datafile fno=00004 name=/opt/oracle/oradata/oradata/bjdb01/drsys01.dbf
X j/_{X ?0input datafile fno=00006 name=/opt/oracle/oradata/oradata/bjdb01/odm01.dbfITPUB个人空间3^!fET0DPh`-P
input datafile fno=00007 name=/opt/oracle/oradata/oradata/bjdb01/tools01.dbf
5E,AMs%^4hO/Z0channel ORA_SBT_TAPE_1: starting piece 1 at 10-MAR-08ITPUB个人空间%J'a ~3_1c4D
RMAN-00571: ===========================================================
v9H`x$c\3F9xi0RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
*OIo.g T#VF/Y0RMAN-00571: ===========================================================
'L-X1c,d6J4@/t0RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 03/10/2008 11:31:12ITPUB个人空间F`/Wqid
ORA-19506: failed to create sequential file, name="tpjatl1b_1_1", parms=""ITPUB个人空间;Z-s"Z9M'[W*av(m,[
ORA-27028: skgfqcre: sbtbackup returned error
&g*~$m-\r5e0ORA-19511: Error received from media manager layer, error text:ITPUB个人空间NeatV[
   VxBSACreateObject: Failed with error:ITPUB个人空间Y*j`6m.ur6s M'x
   Server Status:  unable to allocate new media for backup, storage unit has none available

从这个错误信息上看似乎是空间不足造成的。不过虽然的备份错误信息变为:

RMAN-00571: ===========================================================
nkj4v%o/w!y0RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
V8gJc@$szn3]ny0RMAN-00571: ===========================================================ITPUB个人空间0sW3q cd4~%o9]7Y*S,_
RMAN-03009: failure of backup command on ch00 channel at 03/10/2008 05:14:15
l:l@M M,H?0ORA-19502: write error on file "bk_26552_1_648968690", blockno 664577 (blocksize=512)
T4i4}kE I&y4S0ORA-27030: skgfwrt: sbtwrite2 returned errorITPUB个人空间&[8`/[$hlT$\ Jo
ORA-19511: Error received from media manager layer, error text:ITPUB个人空间8To;]F:qT1e"t@J/o"h
   VxBSASendData: Failed with error:ITPUB个人空间ZVn\B aZ_
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.

从这个错误上看,就不只是空间的问题了。

通过图形界面jnbSA,发现很多管理选项点击后反应很慢,基本上出不来结果。于是采用bpadm从命令行方式进行查询,从REPORTPROBLEM中查询到下面的信息:

03/11/2008 01:45:04 backupcenter240 bpexpdate  Could not build host list: client hostname could not be found
5@ag-Sl9V g003/11/2008 02:13:34 backupcenter240 bjdb01  cannot write image to media id 000013, drive index 0, I/O
错误
f;wq:\|0mvb5S003/11/2008 02:13:48 backupcenter240 bjdb01  backup by oracle on client bjdb01 using policy oracle:  media write errorITPUB个人空间1t0}1h8^^
03/11/2008 02:14:04 backupcenter240 bjdb01  backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)ITPUB个人空间)_0Zzak7@9z t
03/11/2008 02:22:58 backupcenter240 bjdb01  cannot write image to media id 000013, drive index 0, I/O
错误ITPUB个人空间r8H(|5E/}6[)qI
03/11/2008 02:23:12 backupcenter240 bjdb01  backup by oracle on client bjdb01 using policy oracle:  media write error
suar^2ka003/11/2008 02:23:19 backupcenter240 bjdb01  suspending further backup attempts for client bjdb01, policy oracle, schedule Cumulative-Inc because it has exceeded the configured number of tries
6S?0d(["g]*p5es003/11/2008 02:23:19 backupcenter240 bjdb01  backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)ITPUB个人空间/rkM#}+@*M?VB~
03/11/2008 02:23:20 backupcenter240 -  scheduler exiting - the backup failed to back up the requested files (6)
q4\p9?!i b003/11/2008 09:32:42 backupcenter240 data03  cannot write image to media id 000016, drive index 0, I/O
错误
5yu5\d&RK U003/11/2008 09:32:53 backupcenter240 data03  DOWN'ing drive index 0, it has had at least 3 errors in last 12 hour(s)
q},lWB(V_!K003/11/2008 09:32:55 backupcenter240 data03  backup by oracle on client data03 using policy bjdb03-ora:  media write errorITPUB个人空间G)CpDA
03/11/2008 09:33:02 backupcenter240 data03  backup of client data03 exited with status 6 (the backup failed to back up the requested files)
7}8G3u_ViS003/11/2008 10:48:34 backupcenter240 data03  media manager terminated during mount of media id 000016, possible media mount timeout
r*Q A H\_1Vj003/11/2008 10:48:36 backupcenter240 data03  media manager terminated by parent processITPUB个人空间0DNi$NC0OpJ8ws
03/11/2008 10:48:37 backupcenter240 data03  backup by oracle on client data03 using policy bjdb03-ora:  the backup failed to back up the requested filesITPUB个人空间2S&L,D7]:aRk,`vXt
03/11/2008 10:48:38 backupcenter240 data03  suspending further backup attempts for client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of triesITPUB个人空间8daXyki y"{dTqp
03/11/2008 10:48:38 backupcenter240 data03  backup of client data03 exited with status 6 (the backup failed to back up the requested files)ITPUB个人空间+d gR7b8sa5S6W
03/11/2008 13:55:03 backupcenter240 bpexpdate  Could not build host list: client hostname could not be found

进一步查询详细的log信息,发现存在大量的错误:

03/11/2008 18:23:59 backupcenter240 -  cleaning job DBITPUB个人空间XXj1uG._$|2Ble\
03/11/2008 18:23:59 backupcenter240 -  all drives are down for the specified robot number = 0, robot type = TLD and density = hcartITPUB个人空间Ya}^;tQ
03/11/2008 18:23:59 backupcenter240 -  no drives up on storage unit <backupcenter240-hcart-robot-tld-0>ITPUB个人空间e!YaeQ^
03/11/2008 18:24:00 bjdb01 -  all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
;q;ak Q3hu(j003/11/2008 18:24:00 backupcenter240 -  no drives up on storage unit <bjdb01-hcart-robot-tld-0>
]cs/HL C003/11/2008 18:24:31 backupcenter240 -  all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
9~P0Sq-O003/11/2008 18:24:31 backupcenter240 -  no drives up on storage unit <unit_99>ITPUB个人空间%G2Z1sf)i%z|lB
03/11/2008 18:24:32 backupcenter240 -  all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
N*v%V9JZVRr003/11/2008 18:24:32 backupcenter240 -  no drives up on storage unit <unit_data>
Ek(} u,n _(n_003/11/2008 18:24:32 backupcenter240 data03  skipping backup of client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of tries

从这个信息上看,似乎是机械手出现了问题。而且如果真的是机械手的问题,那么也可以解释前后两次备份错误信息的不同。当一个磁带备份满了之后,机械手尝试更换新的磁带,这时出现了故障,而对于当时备份的操作,就出现了无法写入的错误,报错没有足够空间。而随后的备份由于机械手故障,而导致没有可用的磁带可以写入,因此报错NETBACKUP没有初始化完成。

继续检查media的报告,在汇总信息中看到:

Number of ACTIVE media that, as of now:ITPUB个人空间VaG M.Ap!T5@
    There are no ACTIVE media present in the media database

这进一步确定了刚才的判断,机械手故障导致可用的磁带无法放到驱动器中,因此系统中没有可用的介质。

通过tpconfig检查机械手的状态:

Index DriveName              DrivePath                Type    Shared   Status
+u&v%^sUcH ] v0***** *********              **********               ****    ******   ******ITPUB个人空间Nv)M+F*k(vf
  0   IBMULTRIUM-TD10        /dev/rmt/1cbn            hcart    Yes      DOWNITPUB个人空间T!N%} X8L$B`%M m
        TLD(0) Definition       DRIVE=1

Currently defined robotics are:
Lye*C?'a'?L|0  TLD(0)     robotic path = /dev/sg/c2t4l1,ITPUB个人空间)KC,XR P{$M(ite
             volume database host = backupcenter240

机械手处于DOWN的状态,看来问题已经基本确定了。

尝试使用robtest检查机械手:

bash-2.03# robtestITPUB个人空间c#bCR7elS!m
Configured robots with local control supporting test utilities:
a1Q!Hh%Q0  TLD(0)     robotic path = /dev/sg/c2t4l1

Robot SelectionITPUB个人空间7nX2I WzV I3S
---------------
Xb6LA7|O@F9d0  1)  TLD 0
1i3tf:\5yk$p0  2)  none/quitITPUB个人空间5|Z,_ O?~ A
Enter choice: 1

Robot selected: TLD(0)   robotic path = /dev/sg/c2t4l1

Invoking robotic test utility:
(Cm#Yd%P7d#s0/usr/openv/volmgr/bin/tldtest -r /dev/sg/c2t4l1 -d1 /dev/rmt/1cbn

Opening /dev/sg/c2t4l1ITPUB个人空间wXT._|
MODE_SENSE completeITPUB个人空间#T#\y3|h'c}]E4?
Enter tld commands (? returns help information)
!O`#U9dB W/AS@0?

To exit the utility, type q or Q.

init                      - Initialize element statusITPUB个人空间R O0Q+x{x
initrange <d#|s#|p#|t> [#]- Init element status range
zq6X E3s2B0allow                     - Allow media removal
2N)H"n4`&`-B;Z @0prevent                   - Prevent media removal
3q^L"O[cR#Nff0extend                    - Extend media access portITPUB个人空间2`o twB\
retract                   - Retract media access port
;?2uJb ^)x%E` a0mode                      - Mode senseITPUB个人空间TE$c#]8w5t
m <from> <to>             - Move medium
LI&[G d'sT%D0pos <to>                  - Position to drive or slot
:^/g"^l,E wVJ+Z0s [d|p|t|s [n]] [raw]      - Read element status
0h S(z;g \$K0inquiry                   - Display vendor and product IDITPUB个人空间"G4L4z]Y,]7r'On
rezero                    - Rezero unitITPUB个人空间\'WR8@"n)IC
inport                    - Ready inport (media access port)
wfEZ5w7z'u:t0debug                     - Toggle debug mode for this utility
~P'js%x:r0test_ready                - Send a TEST UNIT READY to the device

   <from> <to> specifies drive (d#), slot (s#), media access port (p#),
9t!bD I,V/]T0           or transport (t#)ITPUB个人空间$bOBni J@
   <d#|s#|p#|t#> is drive #, slot #, media access port #, or transport #
)yAl }"_ u yV0           [#] is number of elements for d, s, p, or t
s*nmv)L,~0    NOTE - drive # is 1 - Number of drives
Q _ r2m Dj.n0           slot # is 1 - Number of slots
t5o|OhE-mI0X)|0           media access port # is 1 - Number of media access port elements
V'{_:QZ0           transport # is 1 - Number of transports
EUdg(}F9I#Q0   <type> = (d)rive, (s)lot, media access (p)ort, or (t)ransport

unload <drive>           - Issue SCSI unloadITPUB个人空间;y,}/De$|MaVmP
   <drive> = d1 or 1, d2 or 2, d3 or 3 ... d648 or 648

inquiryITPUB个人空间N#uj$V@3`da
Inquiry_data: STK     L40             0213ITPUB个人空间5B&?.E ]9Sa$J
test_ready
FB&zv2_H0Unit is readyITPUB个人空间-K\P&[W\
q

Robot SelectionITPUB个人空间4`2uR!Ld J6Hc
---------------
3g*Z0EGa}2[&M C0  1)  TLD 0
9A&@G*`cj1}Jc0  2)  none/quitITPUB个人空间b4[)X_Ru0d
Enter choice:

尝试发出test_ready命令,等待一段时间后,发现机械手状态已经恢复正常:

Index DriveName              DrivePath                Type    Shared   Status
!@| Vk F,@ kP)~7G0***** *********              **********               ****    ******   ******
+|H+S4@ H+m0  0   IBMULTRIUM-TD10        /dev/rmt/1cbn            hcart    Yes      UP
#Z VX&vEqs\!W0        TLD(0) Definition       DRIVE=1

Currently defined robotics are:ITPUB个人空间+|+IrJgN+kq-I
  TLD(0)     robotic path = /dev/sg/c2t4l1,
/@Hx%V hV0             volume database host = backupcenter240

下面尝试备份:

$ rman target /

Recovery Manager: Release 9.2.0.4.0 - 64bit Production

Copyright (c) 1995, 2002, Oracle Corporation.  All rights reserved.

connected to target database: BJDB01 (DBID=3255963758)

RMAN> backup current controlfile;

Starting backup at 11-MAR-08
%h(A)p#zxhGRa0using target database controlfile instead of recovery catalog
3bVB"vgF5j!vF3e0allocated channel: ORA_SBT_TAPE_1
2N0x#u2na"u0channel ORA_SBT_TAPE_1: sid=19 devtype=SBT_TAPE
,^Np l:AFb0channel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle - Release 5.0GA (2003103006)ITPUB个人空间'r,z HwD HgL6z
channel ORA_SBT_TAPE_1: starting full datafile backupset
Y#obkwzsFq0channel ORA_SBT_TAPE_1: specifying datafile(s) in backupsetITPUB个人空间%t~'K;_*crl8\b
including current controlfile in backupset
,l$qg$ZPh,V0channel ORA_SBT_TAPE_1: starting piece 1 at 11-MAR-08ITPUB个人空间AoJdn;QQ]
channel ORA_SBT_TAPE_1: finished piece 1 at 11-MAR-08ITPUB个人空间5T_cm^['k? w~e
piece handle=ttjb17ur_1_1 comment=API Version 2.0,MMS Version 5.0.0.0ITPUB个人空间Spe B5dO)g5Kg
channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:04:56ITPUB个人空间2K@A#a,} [,N:f
Finished backup at 11-MAR-08

Starting Control File Autobackup at 11-MAR-08ITPUB个人空间JYiwR V8b
piece handle=c-3255963758-20080311-00 comment=API Version 2.0,MMS Version 5.0.0.0
#gD7t2Jey5~0Finished Control File Autobackup at 11-MAR-08

尝试备份终于成功。

可惜的是,备份小的文件似乎没有问题,一旦备份文件比较大的时候,仍然出现上面的错误信息:

RMAN-00571: ===========================================================
L8f w p)^1hx PKk0RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============ITPUB个人空间o1d h`VnQkb$V
RMAN-00571: ===========================================================
1{ ^'A2R1L0RMAN-03009: failure of backup command on ch00 channel at 03/10/2008 05:14:15
(rN1k%x6DWU0ORA-19502: write error on file "bk_26552_1_648968690", blockno 664577 (blocksize=512)
a9ND2?8ZQ0ORA-27030: skgfwrt: sbtwrite2 returned error
z J?Vyw:t`0ORA-19511: Error received from media manager layer, error text:
Y*{6hL_ay q:u0   VxBSASendData: Failed with error:ITPUB个人空间7{{8m&sC3xW7F^8`'P-N
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.

而且后台日志出现大量的IO错误信息:

03/12/2008 09:42:51 backupcenter240 bjdb01  cannot write image to media id 000016, drive index 0, I/O错误                       ITPUB个人空间 r%E[ p`p6e
03/12/2008 09:42:51 backupcenter240 bjdb01  FREEZING media id 000016, it has had at least 3 errors in the last 12 hour(s)       
-K w)mwF Ih003/12/2008 09:43:08 backupcenter240 bjdb01  CLIENT bjdb01  POLICY oracle  SCHED Default-Application-Backup  EXIT STATUS 84 (media write error)
$N J6L-\N1u#C['Q`003/12/2008 09:43:08 backupcenter240 bjdb01  backup by oracle on client bjdb01:  media write error

看来现在不仅仅是软件问题了,经过供应商最后确认,是带库的读写头出现问题,最终通过更换配件,解决了这个问题。

 


TAG:

yangtingkun的个人空间 引用 删除 yangtingkun   /   2008-04-22 09:52:38
嗯,这种包含太多机械部分的东东都是比较容易损坏。
赵宇的DBA记事本 引用 删除 赵宇   /   2008-04-21 20:04:42
磁带确实不让人放心,上次检查磁带备份,发现不能读取,尝试写入也失败,HP工程师现场确认磁带损坏,而这个磁带只是存放在磁带机,就自己坏了
 

评分:0

我来说两句

显示全部

:loveliness: :handshake :victory: :funk: :time: :kiss: :call: :hug: :lol :'( :Q :L ;P :$ :P :o :@ :D :( :)

Open Toolbar