Backup and Recovery
If hardware fails, it can be replaced.
If software is accidentally deleted, it can be re-installed.
If business data is irrecoverably lost, it is gone. Period.
No discussion of database administration topics would be complete without articles regarding backup and recovery. In fact, backup and recovery is at the top of the list of responsibilities that a DBA must thoroughly understand and for which responsibility for successful implementation must be accepted. "Backup strategy" is a bit of a misnomer. "Recovery strategy" is much more accurate nomenclature. Learn the available backup and recovery options and techniques, determine your organization's recovery requirements, and a suitable recovery strategy can be constructed using that information.
Backup and recovery information is included in the availability section because availability requirements will dictate whether cold backups are acceptable or hot backups are required. Moreso, being unable to perform a successful recovery because of improper backup procedures severely impacts availability. In addition, unsuccessful production recovery attempts reduce the confidence that the organization has in the resilience of the core database technology and platform, and damages the user community's perception of the competence of the administration staff.
Performing regular, periodic recovery testing is absolutely critical to insuring recoverability and correction of any recovery difficulties. An organization should not wait until it needs to perform production recovery to understand how recovery works, and to determine if database backups are performed correctly and consistently. The question of "will the hardware fail?" is not the correct question; "when will the hardware fail" is the correct question. If the business data cannot be recovered when a failure occurs, then it is of no true value to the organization.
The author will not re-iterate information readily available in ORACLE documentation and a variety of third-party database administration guides, except where absolutely necessary. Rather, a focus on information not readily available from other sources will be discussed.
Backup Methods:
1) Cold backup. Shutdown the instance, perform an operating system level backup, then restart the instance. A cold backup allows recovery up to that backup if the database is running in noarchivelog mode, and recovery up to the point of failure if the database is running in archivelog mode. If the availability requirements do not span into weekend or night-time hours, then cold backups may be acceptable. If availability needs normally encompass 24x7 activity, then a hot backup solution should be implemented. Perform periodic recovery tests to insure that the backups are consistent and useable.
2) Hot backup. For each tablespace, issue an ALTER TABLESPACE...BEGIN BACKUP, copy the datafiles belonging to that tablespace from the operating system level, then issue an ALTER TABLESPACE...END BACKUP. Then, issue an ALTER SYSTEM SWITCH LOGFILE and an ALTER DATABASE BACKUP CONTROLFILE TO '<path>' REUSE. In addition, an ALTER DATABASE BACKUP CONTROLFILE TO TRACE NORESETLOGS can be issued to write the SQL statements defining the control files to the instance trace information. This is the most reliable hot backup method. Note that the database must be running in archivelog mode to perform this type of hot backup. This method allows recovery up to the point of failure. Perform periodic recovery tests to insure that the backups are consistent and useable.
3) ORACLE's backup utilities. ORACLE supplies backup and recovery utilities paired with different versions of the database software. These utilities can make hot or cold backups easier to perform, but their reliability should be thoroughly tested before production use. Perform periodic recovery tests to insure that the backups are consistent and useable.
4) Third-party backup utilities. A variety of third party organizations provide backup utilities for ORACLE databases. Thoroughly test the reliability of these utilities before production use. Perform periodic recovery tests to insure that the backups are consistent and useable.
5) Full exports. Although full exports can provide a referentially-consistent snapshot of data and objects in an ORACLE database, it should used as a supplement to regular backups rather than a substitute. Full exports are an additional safeguard against logical errors, such as accidental dropping or truncating a table. Note that a full import cannot be used with archive logs to roll forward to the point of failure.
Considerations for REDO and ARCHIVE loggingAn ORACLE database cannot function without redo logs available. It is essential to have at least one mirrored copy of the redo log group configured, even if the redo log files are already mirrored at the hardware level. There are two situations that mandate this requirement:
1) If there is only one copy of the redo log group, then a failure of that logical drive (whether the mechanical failure is a single physical disk drive, an entire RAID array, or a controller or controller channel) will result in downtime. If there are at least two redo log groups on different logical drives (in this case meaning, different physical disks or a different array and array controller), then the database instance continues to operate using the surviving redo log group, and will merely write error messages to the ORACLE alert log.
2) It is possible for the redo logging process to encounter a write error for a particular redo log member in a group. This redo log will be marked as "STALE" in the V$LOG table. If there is only one copy of the redo log group (and the database is running in archivelog mode), then the database will halt until an administrator manually clears the stale redo log. Note that if a hardware-mirrored redo log is marked as "STALE", then both copies of the file on the physically mirrored drives are stale as well, so the halt still occurs. Without this archive log, full recovery from an immediately following media failure involving other database files would not be possible, so it is recommended to perform an immediate hot backup. If there are at least two redo log groups, then the archiver process will choose the copy of the particular redo log that is not marked as "STALE", and archive this redo log. The probability that a write error will occur on more than one ORACLE-mirrored copy of any particular redo log is typically very slim. By ORACLE-mirroring redo logs (in addition to any hardware or O/S level fault-tolerance), recoverability is not impacted and no immediate hot backup is required.
If the ORACLE database in question contains data that the business users would be displeased to have to re-key, then the database should be set to run in archivelog mode. Running in this mode causes redo logs to be copied out to an archive log file before being cleared and re-filled with new transaction information. Archive logging enables recovery up to the point in time of a media failure with no lost data. However, it is necessary to note that archive logs are written by the database but never deleted. When the archive log destination device becomes full, the database will halt and wait for space to be cleared. Even though this is not an "instance crash", it does produce downtime. No further transactions or even logins can occur until the archiver process becomes able to continue archiving redo logs. It is highly recommended that an automatic process be constructed to remove archive logs that have been backed up to reduce this source of potential down-time. Note that the space requirements for archive logging will vary over time. Archive log creation rate is a function of transaction rate, while deletion of archive logs is typically a periodic process (for instance, done nightly after the archive logs have been backed-up). For instance, there may be 10 archive logs generated one day and 1000 archive logs generated the next day, dependent upon the transaction rate. This variability should be taken in consideration when planning space requirements for archive logs to avoid unnecessary production interruptions.
Common incorrect methods for performing database backups1) "Cold" backup with no shutdown. Copying ORACLE datafiles, redo log files, control files, etc., to tape without first shutting down the ORACLE instance will not yield a backup that is useable for recovery.
2) "Hot" backup with no ALTER TABLESPACE...[BEGIN|END] BACKUP statements. Before datafiles can be copied from the operating system level with the instance started and the database open, an ALTER TABLESPACE...BEGIN BACKUP must be issued against the tablespace referencing those datafiles. After issuing this statement, the referenced datafiles are updated by a checkpoint and additional information needed to re-establish consistency in the event that the backed-up datafiles are used in a recovery operation is recorded to the redo logs, thus the datafile can be copied to tape, another location on disk, etc. After copy, an ALTER TABLESPACE...END BACKUP must be issued to bring the referenced datafiles out of backup mode and end recording of the addition redo information. An O/S copy done without issuing these statements will not be useable for recovery. Note that NTBackup and other backup utilities included with the NT operating system do not connect to an ORACLE instance and issue these statements.
3) Backup of open control files by O/S copy. Issue an ALTER DATABASE BACKUP CONTROLFILE TO '<path>' REUSE to create a backup copy of an open control file. An operating system copy of the control files will only be useable for recovery if the database is shutdown. An operating system copy of an open control file will not be useable for recovery.
4) ORACLE backup tools. Do not assume that ORACLE-supplied backup and recovery utilities perform flawlessly. Thoroughly test these tools before using in a production environment. These ORACLE tools do not typically perform ALTER TABLESPACE...[BEGIN|END] BACKUP, ALTER DATABASE BACKUP CONTROLFILE, and ALTER SYSTEM SWITCH LOGFILE statements. Rather, they use specific datafile reading mechanisms specific to the paired version of the ORACLE software. The best test is to perform a recovery test using the backup generated by these tools.
5) Third party backup tools. Do not assume that third party tools work flawlessly. Thoroughly test these tools before using in a production environment. Again, the best test is to perform a recovery test using the backup generated by these tools. Third-party tools that do not perform ALTER TABLESPACE...[BEGIN|END] BACKUP, ALTER DATABASE BACKUP CONTROLFILE, and (preferably) ALTER SYSTEM SWITCH LOGFILE statements should be examined with extreme scrutiny, if considered at all. Verify that the tool issues these statements by checking the V$SQL table after performing a backup using the tool.
6) Restore from a full export, and attempt to roll forward using archive logs. Although a referentially intact restore can be completed from a full export, archive logs cannot be used to roll forward to changes performed after the consistent full export was started.
7) Copy of open redo logs. Copies of open redo logs, regardless of the tool or method used to make the copy, are entirely unusable for recovery. Cold copies of redo logs are useable for recovery, but only with the matching backup of the datafiles and control files. Redo logs should always be mirrored at the ORACLE level, even if they are stored on drive arrays that use hardware mirroring. If there are one or more mirrored copies of the redo log group, then a hot backup of redo logs would not be needed even if it could be used. It is good practice to insure that redo logs are not copied while open, as these files could be restored from tape accidentally, thus overwriting redo logs that can be used for recovery.
Special note: archive logging on NT (bug #381394)This particular NT-specific bug merits special discussion, as most organizations using ORACLE on NT are unaware of it, and it can cripple recovery. This bug is applicable to all versions of ORACLE for NT, releases 7.0.x through 8.0.x. All versions of the "Getting Started" guides for ORACLE on NT make the following suggestion for archive log naming convention specified in the init.ora file: log_archive_format = ARC%S.%T. Using this particular string generates archive logs named "ARC<sequence_number>.<thread_number>", in DOS 8.3 notation. If you are not running ORACLE parallel server, the "%T" will always generate "001" as the thread number. The "%S" will generate the first 5 digits of the sequence number found in the V$LOG table. Please examine the following chart:
Sequence number Archive log name Explanation 101 ARC00101.001 OK. ... ... ... 99999 ARC99999.001 OK 100000 ARC10000.001 OK 100001 ARC10000.001 Overwritten once! 100002 ARC10000.001 Overwritten twice! ... ... ... 100009 ARC10000.001 Overwritten nine times! Once the sequence number exceeds 100000, archive logs are overwritten. Obviously while the sequence number is in the 100000 range, each archive log will be overwritten nine times. Recovery in the event of a media failure will not be possible.
The workaround for this particular bug is to set the archive log naming convention in the following manner: log_archive_format = ARC%s.%T. Using the lower case "s" instructs the archive logger process to generate the sequence number portion of the archive log names as NT long filenames rather than DOS 8.3 filenames. Please examine the following chart:
Sequence number Archive log name Explanation 101 ARC101.001 OK. ... ... ... 99999 ARC99999.001 OK 100000 ARC100000.001 OK 100001 ARC100001.001 OK! 100002 ARC100002.001 OK! ... ... ... 100009 ARC100009.001 OK! Complete recovery would now be possible. (Please note that this bug was discovered while performing a recovery test!) An alternative solution is to watch for the sequence number to approach 100000, shutdown the database, perform a cold backup, then restart the instance and open the database with the RESETLOGS option. However, the first workaround does not generate an unnecessary outage.
Recovery TestingIdeally, recovery testing should be performed on a regular, periodic basis (such as once quarterly), and performed on a different box if at all possible. (Please see the section on spare box testing.) The adage that "practice makes perfect" is unquestionably true in this regard. Practicing recovery verifies that the backup procedures work correctly and that there are no unexpected errors, as well as giving the recovery personnel the necessary knowledge and experience to perform the recovery without having to find and read the manuals first! If the DBA's and system administrators know exactly how to proceed, then they are much more likely to quickly and efficiently complete the recovery in a calm and rational manner without unnecessary panic and mistakes.
One of the best methods to test recovery of a production instance without interruption in service to that instance is to perform what is called an "Incomplete Cancel-based Recovery". This recovery technique uses backups of datafiles, backup control files, and archive logs (which is, incidentally, the components that should be on tape after hot backups), and should ideally be performed using a different non-production server. Recovery is canceled after the last available archive log and before application of redo logs, yielding a database that is consistent and useable but lacks only the changes contained in the current redo logs. If recovery proceeds without errors through application of all available archive logs, then it is a reasonably safe assumption that recovery including application of redo logs will yield successful results as well. The link below provides the exact procedure for performing an incomplete cancel-based recovery. Note that for production failures, complete recovery should be the objective. Incomplete recovery for production failures should only be used if all the components for a successful complete recovery are not available.