For DBAs running production databases on professional grade storage systems, corruptions are rare. Indeed EMC arrays include additional technology to actively prevent silent data corruption and bit rot.
However corruption does still occur for a number of different reasons, including hardware failure, software error and sometimes user error. When it does occur, knowing how to properly identify the affected objects, and how in some cases to rescue usable data, can be important time savers to the production Oracle DBA.
In the following post we look at a step by step approach to isolating corruption, verifying it, and then rescuing what data we can. The following was tested on Oracle 10.2.0.2 on a Windows platform.
One of the occupational hazards of being a production DBA is the proverbial 2am call when the production Oracle database crashes.
Oracle has steadily gotten better at self-correcting the issues that cause crashes, and in a well maintained environment with properly sized servers, UPS power, solid backups and Change Control, such outages are mercifully rare.
When they do occur however, it will quickly expose hitherto unforeseen gaps in the production support plan and test the professionalism of all concerned.
The following is a summary of a recent production outage, the symptoms, the causes and the resolutions. Although Oracle’s Metalink system and Oracle Support Services assisted in the identification of the solution, a great deal of work was required from the production DBA staff, and so some of these notes may prove useful for others either as study or worst-case, as verification for anyone facing the same scenario.
Last week a former client experienced a severe hardware failure.
The original incident occurred when a disk subsystem failed. This client uses a very old Solaris system running Solaris 2.6 and much of the hardware is long past its original expected life span.
The database is Oracle 126.96.36.199, and as expected it crashed with the hardware error.
Deciding they could not recover the system themselves, we agreed to to assist them. We were presented with a database with a missing data file – file 204 which we knew to be rlmd01.dbf. The file existed on the disk in the correct location, but Oracle refused to recognize it.
The alert log showed the following errors:
Mon Oct 5 14:04:14 2009
Errors in file /orabin/PROD/comn/admin/bdump/prod_lgwr_16110.trc:
ORA-01157: cannot identify/lock data file 204 - see DBWR trace file
ORA-01110: data file 204: '/oradata/PROD/data04/rlmd01.dbf'
Checking the UNIX disk we see this:
$ls –al rlmd01.dbf
-rw-r--r-- 1 oracle dba 10497461 Oct 5 16:40 rlmd01.dbf
Examining the trace file that accompanied the error, we could that Oracle was really complaining that the data file in question was not a multiple of the logical block size – which is Oracle error 27046.
I put this article together back in 1999, after trying to debug problems with a Beckman Coulter LIMS system running against a VMS database.
The Log Miner feature makes this work obsolete today, but the article might still be of interest to DBAs who want to look under the covers.
Dissassembling the Redo Log