Last week a former client experienced a severe hardware failure.
The original incident occurred when a disk subsystem failed. This client uses a very old Solaris system running Solaris 2.6 and much of the hardware is long past its original expected life span.
The database is Oracle 126.96.36.199, and as expected it crashed with the hardware error.
Deciding they could not recover the system themselves, we agreed to to assist them. We were presented with a database with a missing data file – file 204 which we knew to be rlmd01.dbf. The file existed on the disk in the correct location, but Oracle refused to recognize it.
The alert log showed the following errors:
Mon Oct 5 14:04:14 2009 Errors in file /orabin/PROD/comn/admin/bdump/prod_lgwr_16110.trc: ORA-01157: cannot identify/lock data file 204 - see DBWR trace file ORA-01110: data file 204: '/oradata/PROD/data04/rlmd01.dbf'
Checking the UNIX disk we see this:
$ls –al rlmd01.dbf -rw-r--r-- 1 oracle dba 10497461 Oct 5 16:40 rlmd01.dbf
Examining the trace file that accompanied the error, we could that Oracle was really complaining that the data file in question was not a multiple of the logical block size – which is Oracle error 27046.
When we examined the data file on disk we could see that the file size did not match any other data files. This was unexpected as the most of the data files for this system are sized to 10Mb except the heavily used ones. Whereas most of the files had a size of 10493952, this file had a size of 10497461.
Running a tail on the data file suggested that garbage from the mail program had been appended to the file, possibly as a result of the storage hardware failure. The file was now the wrong size and Oracle was refusing to read it.
Our first idea was to restore the individual file from last night’s backup. A full restore operation would have been very time consuming and there was no information on when this system’s backup had last been full tested with a recovery.
The current file was renamed so it would not be over written, and the file from last night was restored from the backup. This attempt met with a swift and unsuccessful conclusion when Oracle reported that the data file was from before the last RESETLOGS operation.
We then discovered that when the database was first restarted, an inexperienced DBA attempted a recovery including the use of the RESETLOGS command. We now had no opportunity to use any of the files from the last backup unless we wanted to chance everything by wiping the whole database and restoring to the unproven backup.
Fortunately UNIX allows the DBA to essentially “saw-off” the end of a data file using the “dd” command.
We guessed we needed the file to be exactly 10Mb like most of the other files of the database. So selecting a default block size of 512, we divided the target of 10493952 bytes by 512 and get a block count of 20496. We then issue the dd command as follows:
$ dd if=rlmd01.dbf.tusc of=rlmd01.dbf bs=512 count=20496
Then the database was started:
SQL> startup mount ORACLE instance started. Total System Global Area 1493008544 bytes Fixed Size 73888 bytes Variable Size 999301120 bytes Database Buffers 491520000 bytes Redo Buffers 2113536 bytes Database mounted. SQL> select * from v$recover_file; no rows selected SQL> alter database datafile 204 online; Database altered. SQL> recover datafile 204; ORA-00283: recovery session canceled due to errors ORA-00264: no recovery required SQL> alter database open; Database altered.
Finally we ran the DBV to make sure there was no corruption:
$ dbv file=rlmd01.dbf blocksize=8192 DBVERIFY: Release 188.8.131.52.0 - Production on Mon Oct 5 18:44:32 2009 (c) Copyright 2000 Oracle Corporation. All rights reserved. DBVERIFY - Verification starting : FILE = rlmd01.dbf DBVERIFY - Verification complete Total Pages Examined : 1280 Total Pages Processed (Data) : 3 Total Pages Failing (Data) : 0 Total Pages Processed (Index): 0 Total Pages Failing (Index): 0 Total Pages Processed (Other): 78 Total Pages Empty : 1199 Total Pages Marked Corrupt : 0 Total Pages Influx : 0
The database is now operation again and the client was saved the extended outage period and considerable risk of attempting a full restore.
I would like to thank Doug Freyburger and Pat McGovern for their assistance in saving this database.