RAID
There is significant confusion in many organizations using ORACLE on NT over when and what RAID technologies should be used. This article will attempt to make this information very clear and understandable.
Types of RAID available with NT-compatible hardwareVendors typically offer the following choices for RAID configurations:
1) RAID0: Disk stripping. RAID0 requires at least two physical disks. Data is read and written across multiple drives, so disk I/O is relatively evenly spread. Writes can occur in a block or streaming manner (similar to non-RAIDed disks) as requested by the operating system. Disk failure results in lost data. Compared to a single disk drive, RAID0 has the following attributes:
- Better read performance
- Better write performance
- Inexpensive in cost
- Not fault-tolerant
- Storage equivalent to sum of physical drive storage in the array
- Readily available from most vendors
2) RAID1: Disk mirroring. RAID1 requires two physical disks. Logical writes are done by physically writing the data to both disks simultaneously, and can typically be done in a block manner or streaming manner, as requested by the operating system. Reads can be done using either disk. In the event of a disk failure, data can still be retrieved and written to the surviving disk. Compared to a single disk drive, RAID1 has the following attributes:- Better read performance
- Similar write performance
- Expensive
- Fault-tolerant
- Storage equivalent to 1/2 the sum of the physical drive storage in the mirrored set.
- Readily available from most vendors
3) RAID5: Disk stripping with parity. RAID5 requires at least three physical disks. On a logical write, a block of data is physically written to disk, parity information is calculated using the block just written plus blocks already existing on disk, then the parity information is written to disk. In RAID5, the parity information is rotated among the physical disks to prevent bottlenecks caused by a dedicated parity disk. Note that writes occur in a block manner regardless of whether the O/S is sending a stream of data to be written or requests to write whole blocks. On a logical read, data is read from multiple disks in a manner very similar to RAID0. In the event of a disk failure, data can be reconstructed on the fly using the parity information. Compared to a single disk drive, RAID5 has the following attributes:- Data is stripped across multiple physical disks and parity data is stripped across storage equivalent to one disk.
- Better read performance
- Poorer write performance
- Inexpensive
- Fault-tolerant
- Storage equivalent to N - 1 times the number of physical drives in the array.
- Readily available from most vendors
4) RAID10 (or RAID0+1): Mirrored stripe sets. RAID10 requires at least 4 physical drives, and combines the performance gains of RAID0 with the fault-tolerance and expense of RAID1. Data is written simultaneously to two mirrored sets of striped disks in blocks or streams. Reads can be performed against either striped set. In the event of a failure of a disk drive in one striped set, data can be written to and read from the surviving striped set. Compared to a single disk drive, RAID10 has the following attributes:- Better read performance
- Better write performance
- Expensive
- Fault-tolerant
- Storage is 1/2 of the sum of the physical drives' storage
- Currently available from only a few vendors (at the time of this writing)
RAID Adapter ConsiderationsRAID adapters, or array controllers, for Intel-based hardware are typically SCSI devices that plug into PCI slots in the server hardware. These devices, like the hardware they plug into, vary quite a bit in capability from vendor to vendor. Most SCSI RAID adapters will support non-RAID, RAID0, RAID1, and RAID5 configurations per channel, and will vary in cache size from no cache up to 16 or even 32 Mb of cache. The cache will typically be used to balance improved read performance (by read-ahead of data) and improved write performance by caching physical writes. Most RAID adapters will have a setting that specifies whether the write cache is set to "write-through" (meaning physically write the information on disk before acknowledging the write to the operating system) or "write-cache" (meaning provide acknowledgement to the operating system after the data has been cached for write but before it has been physically written to disk). ORACLE Coporation recommends setting this parameter to "write-through". If a power outage occurred after the write acknowledgement has been received by the O/S (and the database instance) but before the data was written to disk, then a datafile error would be encountered on the next startup of the instance. There are a few RAID adapters that theoretically overcome this problem by battery backup of the cache contents. In this case, even if the power failed, the contents of the cache that had not been written to disk would be maintained by the battery until the power was restored and the server hardware restarted. If this feature is used to provide better write performance, then it's resilience should be thoroughly tested prior to placing a database server configured with "write-cache" settings into production use. A simple way to verify the reliability of the battery cache backup (as well as normal instance recovery processes) is by repeatedly powering-off the server while the database instance is started and writes are occurring. If this write-cache setting is to be used, verify that it is infallible before placing the database in production use. If the "write-cache" setting does not absolutely guarantee protection from data loss, then do not use this setting. Check with your specific hardware vendor for detailed information regarding the functionality of the selected RAID adapters, and their specific settings.
Workload ConsiderationsAs discussed in the article on O/S tuning, single disk drives and disk arrays have a sustainable workload threshold (ie. transfers/second) that, if exceeded, will result in a non-linear performance degradation (ie. drastic increase in seconds/transfer). Equations are presented below that will approximate the workload of a particular disk array, given the selected RAID configuration, the number of drives, and the workload threshold of the physical disks composing the array. The workload threshold of the physical disks should be obtained from your particular hardware vendor, as this number will vary with model and manufacturer. The typical workload threshold for hot-swappable SCSI disks at the time of this writing is 60-80 transfers/second. Whether the transfers are reads or writes typically has little impact: the physical disk rotates at a constant rate, so the time spent moving data is essentially constant regardless of whether it's being read or written, so this number will typically be provided as transfers/second rather than reads/second and writes/second. Note that this behavior is different for disk arrays, depending upon the RAID configuration selected, as will be seen in the following equations.
1) RAID0: All disks are available for read and write activity. The workload threshold of the array is essentially the sum of the workload thresholds of the individual disks. For consistency, the equation will be presented with read and write activity seperated.- RAID0 array max transfers/sec = (%reads) x (max transfers/sec/disk) x (# disks) + (%writes) x (max transfers/sec/disk) x (# disks)
2) RAID1: Both disks are written simultaneously, but data can be read from either disk. The read workload threshold is essentially twice the write workload threshold. The number of disks will always be 2, of course, but the equation includes that parameter for consistency's sake.
- RAID1 array max transfers/sec = (%reads) x (max transfers/sec/disk) x (#disks) + (%writes) x (max transfers/sec/disk) x (#disks/2)
3) RAID5: All disks are available for logical reads (minus one disk's worth of storage used for parity information), but a logical write requires 1 physical write for the data, two physical reads to retrieve existing data to use in the parity calculation, then another physical write to store the parity block. In other words, a logical write requires four physical I/O's. Recall that these writes also only occur at the block level. (Note that specific vendors may have variations on this algorithm, so check with the vendor of your hardware to verify their exact implementation of RAID5, then modify the given equation to reflect the specific algorithm used if it is different. The equation given below assumes the default algorithm, thus providing the most conservative estimate of workload threshold.)
- RAID5 array max transfers/sec = (%reads) x (max transfers/sec/disk) x (#disks-1) + (%writes) x (max transfers/sec/disk) x (#disks/4)
It should be noted that these calculations are approximations, and that workload information provided by the vendor will typically be the result of laboratory testing. As such, the true workload threshold of a disk array will vary with each installation, the specific types of accesses, etc. However, these approximations will provide reasonably dependable information necessary to identify and alleviate disk-related bottlenecks. In the charts below, a series of sample calculations have been performed to help provide insight into the characteristics of each type of RAID configuration. Assumptions for the first chart include 4 physical disks of size 4Gb each, having a vendor-reported workload threshold of 60 transfers/second each. The second chart examines the same scenario using 6 physical disks.Possible configurations using 4 physical disks:
Configuration Number of disks Available space Max Reads/Sec Max Writes/Sec Single disk 1 4 Gb 60 60 RAID0 4 16 Gb 240 240 RAID1 4 8 Gb 240 (2 arrays) 120 (2 arrays) RAID5 4 12 Gb 180 60 Possible configurations using 6 physical disks:
Configuration Number of disks Available space Max Reads/Sec Max Writes/Sec Single disk 1 4 Gb 60 60 RAID0 6 24 Gb 360 360 RAID1 6 12 Gb 360 (3 arrays) 180 (3 arrays) RAID5 6 20 Gb 300 90 As can be seen from the charts, RAID0 offers good read and write performance, but no fault tolerance. RAID1 offers good read performance, and half as much write performance, but provides fault-tolerance. RAID5 reclaims most of the space lost to RAID1, provides fault-tolerance, offers reasonably good read performance, but poor write performance. (In fact, RAID5 requires 4 disks to regain the same write performance as a single disk). Also, note that streaming logical writes, as well as block-level logical writes, to RAID5 arrays are handled as block-level physical writes. Finally, read or write workload capacity can be increased in any RAID configuration by adding physical disks.
ORACLE database files on RAIDGiven the information regarding the advantages and disadvantages of various RAID configurations, how does this information apply to an ORACLE instance? The discussion below will provide information about how database files are used by an ORACLE instance under OLTP and DSS classifications of workload. In general, it is the author's opinion that fault-tolerant disk configurations should be used to improve availability unless performance requirements and budgetary constraints entirely prohibit it's use. Note that the perspectives presented below are very sensitive to the number of users: if your organization has a 10-20 user OLTP system (and thus, a low throughput requirement), then you may get very acceptable performance with all database files stored on RAID5 arrays. On the other hand, if your organization has a 100 user OLTP system (resulting in a higher throughput requirement), then a different RAID configuration may be absolutely necessary. An initial configuration can be outlined by estimating the number of transactions (based on the number of users), performing adjustments to encompass additional activity (such as hot backups, nightly batch jobs, etc.), then performing the necessary mathematical calculations. It is strongly suggested to compare this information to published benchmarks for validation. (Please see the article in this section on sizing.)
Note that any RAID (or non-RAID) configurations CAN be used; however, these decisions DO impact performance, availability, and expense. Each organization will need to determine it's own availability, performance, and budgetary constraints. Many NT shops favor RAID5 as a default configuration, since it is fault-tolerant and inexpensive compared to other fault-tolerant configurations. From a performance perspective, RAID5 may or may not work well for certain types of database files. The information below should help in making those decisions.
OLTP (On-line transaction processing) workloadsCharacterized by multi-user concurrent INSERTS, UPDATES, and DELETES during normal working hours, plus possibly some mixture of batch jobs nightly. Large SELECTS may generate reports, but the reports will typically be "canned" reports rather than ad-hok queries. The focus, though, is on enabling update activity that completes within an acceptable response time. Ideally, each type of database file would be spread out over it's own private disk subsystem, although grouping certain types of files together (when the number of disks, arrays, and controllers is less than ideal) may yield adequate performance. (Please see the article on Instance tuning for information regarding groupings of database files in an OLTP system.)
1) Redo logs. During update activity, redo logs are written to in a continuous and sequential manner, and are not read under normal circumstances. RAID5 would be the worst choice for performance. Oracle Corporation recommends placing redo logs on single non-RAIDed disk drives, under the assumption that this configuration provides the best overall performance for simple sequential writes. Redo logs should always be multiplexed at the ORACLE software level, so RAID1 provides few additional benefits. Since non-RAID and RAID0 configurations can vary with hardware from different vendors, the organization should contact their hardware vendor to determine whether non-RAIDed disks or RAID0 arrays will yield the best performance for continuous sequential writes. Note that even if redo logs are placed on RAID1 arrays that the redo logs should still be mirrored at the ORACLE level. When the log writer process determines that it does not know whether the contents of a particular redo log are valid, it will mark that redo log as "STALE" in the V$LOG table. If this redo log is the only copy, then it cannot be archived, and will cause a database halt (assuming the database is running in archivelog mode). If redo logs are multiplexed as recommended by Oracle Corporation, then the archiver process will choose a copy of the redo log that is not marked as "STALE", thus generating no interruptions. If the redo logs are mirrored only at the hardware level, then both copies of the redo log are "STALE".
2) Archive logs. As redo logs are filled, archive logs are written to disk one whole file at a time (assuming, of course, that the database is running in archivelog mode), and are not read under normal circumstances. Any RAID or non-RAID configuration could be used, depending upon the performance requirements and size of the redo logs. For instance, if the redo logs are large, then they will become full and be archived less often. If an archive log is likely to be written no more than once per minute, then RAID5 may provide acceptable performance. If RAID5 proves too slow, then a different RAID configuration can be chosen, or the redo logs can simply be made larger. Numerous early ORACLE installations wrote redo logs directly to tape rather than disk, so reasonable sizing of redo logs can unquestionably minimize the write requirements enough to make RAID5 performance acceptable in small to medium volume installations. Note that a fault-tolerant configuration is advisable: if the archive log destination becomes unavailable, the database will halt.
3) Rollback Segments. As modifications are made to the database tables, undo information is written to the buffer cache in memory. These rollback segments are used to to maintain commitment control and read consistency. Rollback segment data is periodically flushed to disk by checkpoints. Consequently, the changes to the rollback segments are also recorded in the redo logs. However, a smaller amount of information is typically written to the rollback segments than to the redo logs, so the write rate is less stringent. A fault-tolerant configuration is advisable, since the database cannot operate without rollback segments, and recovery of common rollback segments will typically require an instance shutdown. If the transaction rate is reasonably small, RAID5 may provide adequate performance. If it does not, then RAID1 (or RAID10) should be considered.
4) User tables and indexes. As updates are performed, these changes are stored in memory. Periodically, a checkpoint will flush the changes to disk. Checkpoints occur under two normal circumstances: a redo log switch occurred, or the time interval for a checkpoint expired. (There are a variety of other situations that trigger a checkpoint. Please check the ORACLE documentation for more detail.) Like redo log switches and generation of archive logs, checkpoints can normally be configured so that they occur approximately once per minute. Recovery can be performed up to the most recent checkpoint, so the interval should not be too large for an OLTP system. If the volume of updated data written to disk at each checkpoint is reasonably small (ie. the transaction rate is not extremely large), then RAID5 may provide acceptable performance. Additionally, analysis should be performed to determine the ratio of reads to writes. Recalling that RAID5 offers reasonably good read performance, if the percentage of reads is much larger than the percentage of writes (for instance, 80% to 20%), then RAID5 may offer acceptable performance for small, medium, and even some large installations. A fault-tolerant configuration is preferable to maximize availability (assuming availability is an objective of the organization), although only failures damaging datafiles for the SYSTEM tablespace (and active rollback segments) require the instance to be shutdown. Disk failures damaging datafiles for non-SYSTEM tablespaces can be recovered with the instance on-line, meaning that only the applications using data in tablespaces impacted by the failure will be unavailable. With this in mind, RAID0 could be considered if RAID5 does not provide the necessary performance. If high availability and high performance on a medium to large system are explicit requirements, then RAID1 or RAID10 should be considered.
5) Temp segments. Sorts too large to be performed in memory are performed on disk. Sort data is written to disk in a block-oriented. Sorts do not normally occur with INSERT/UPDATE/DELETE activity. Rather, SELECTS with ORDER BY or GROUP BY clauses and aggregate functions (ie. operational reports) , index rebuilds, etc., will use TEMP segments only if the sort is too large to perform in memory. Temp segments are good candidates for non-RAIDed drives or RAID0 arrays. Fault-tolerance is not critical: if a drive failure occurs and datafiles for a temp segment are lost, then the temp segment can either be recovered in the normal means (restore from tape and perform a tablespace recovery), or the temp segment can simply be dropped and re-created since there is no permanent data stored in the temp segment. Note that while a temp segment is unavailable, certain reports or index creations may not execute without errors, but update activity will typically not be impacted. With this in mind, RAID1 arrays are a bit unnecessary for temp segments, and should be used for more critical database files. RAID5 will provide adequate performance if the sort area hit ratios are such that very few sorts are performed on disk rather than in memory.
6) Control files. Control files are critical to the instance operation, as they contain the structural information for the database. Control files are updated periodically (at a checkpoint and at structural changes), but the data written to the control files is a very small quantity compared to other database files. Control files, like redo logs, should be multiplexed at the ORACLE software level onto different drives or arrays. Non-RAIDed drives or or any RAID configuration would be acceptable for control files, although most organizations will typically distribute the multiple copies of the control files with the other database files, given that the read and write requirements are so minimal. For control files, maintaining multiple copies in different locations should be favored over any other concern.
7) Software and static files. The ORACLE software, configuration files, etc. are very good candidates for RAID5 arrays. This information is not constantly updated, so the RAID5 write penalty is of little concern. Fault-tolerance is advisable: if the database software (or O/S software) becomes unavailable due to a disk failure, then the database instance will abort. Also, recovery will include restore or re-installation of ORACLE software (and possible operating system software) as well as restore and recovery of the database files. RAID5 provides the necessary fault-tolerance to prevent this all-inclusive recovery, and good read performance for dynamic loading and unloading of executable components at the operating system level.
DSS (Decision Support System) workloadsIn comparison to OLTP systems, DSS or data warehousing systems are characterized by primarily SELECT activity during normal working hours, and batch INSERT, UPDATE, and DELETE activity run on a periodic basis (nightly, weekly, or monthly). There will typically be a large amount of variability in the number of rows accessed by any particular SELECT, and the queries will tend to be of a more ad-hock nature. The number of users will typically be smaller than their ajoining OLTP systems (where the data originates). The focus is on enabling SELECT activity that completes within an acceptable response time, while insuring that the batch update activity still has capacity to complete in it's allowable time window. Note now that there are two areas of performance over which to be concerned: periodic refreshes and ad-hock read activity. The general level directive in this case should be to configure the database such that read-only performed by end users is as good as it can get without rendering refreshes incapable of completion. As with OLTP systems, each type of database file would ideally have it's own private disk subsystem (disks, arrays, and controller channel), but with less than ideal resources certain grouping tend to work well for DSS systems. (Please see the article on Instance tuning for information on these groupings.)
1) Redo logs. Redo logs are only written to while update activity is occurring. In a DSS-oriented system, a significant portion of data entered interactively during the day may loaded into the DSS database during only a few hours. Given this characteristic, redo logging may tend to be more of a bottleneck on periodic refresh processes of a DSS database than on it's ajoining OLTP systems. If nightly loads are taking longer than their allowance, then redo logging should be the first place to look. The same RAID/non-RAID suggestions that apply to redo logging in OLTP also apply with DSS systems. As with OLTP systems, redo logs should always be mirrored at the ORACLE software level, even if they are stored on fault-tolerant disk arrays.
2) Archive logs. Like redo logging, archive logs are only written out during update activity. If the archive log destination appears to be over-loaded with I/O requests, then consider changing the RAID configuration, or simply increase the size of the redo logs. Since there is a large volume of data being entered in a short period of time, it may be very reasonable to make the redo logs for the DSS or data warehouse much larger (10 or more times) than the redo logs used by the OLTP system. A reasonable rule of thumb is to target about one log switch per hour. With this objective met, then the disk configuration and fault-tolerance can be chosen based on the same rules used for OLTP systems.
3) Rollback Segments. Again like redo logging, rollback segments will be highly utilized during the periodic refreshes, and virtually unused during the normal work hours. Use the same logic for determining RAID or non-RAID configurations on the DSS database that would be used for the OLTP systems.
4) User tables and indexes. Writes are done to tablespaces containing data and indexes during periodic refreshes, but during the normal work hours read activity on the table and indexes will typically far exceed the update work performed on a refresh. A fault-tolerant RAID configuration is suggested to sustain availability. However, in most cases the business can still operate if the DSS system is unavailable for several hours due to a disk failure. Information for strategic decisions may not be available, but orders can still be entered. If the DSS has high availability requirements, select a fault-tolerant disk configuration. If RAID5 arrays can sustain the periodic refresh updates, then it is typically a reasonably good choice due to it's good read performance. As seen above, the read and write workload capacities can be adjusted by adding physical drives to the array.
5) Temp segments. In a decision support system or data warehouse, expect temp segment usage to be much greater than what would be found in a transaction system. Recalling that temp segments do not store any permanent data and are not absolutely necessary for recovery, RAID0 may be a good choice. Keep in mind, though, that the loss of a large temp segment due to drive failure may render the DSS unusable (unable to perform sorts to answer large queries) until the failed drives are replaced. If availability requirements are high, then a fault-tolerant solution should be selected, or at least considered. If the percentage of sorts on disk is low, then RAID5 may offer acceptable performance; if this percentage is high, RAID1 or RAID10 may be required.
6) Control files. As with OLTP systems, control files should always be mirrored at the ORACLE software level regardless of any fault-tolerant disk configurations. Since reads and writes to these files are minimal, any disk configuration should be acceptable. Most organizations will typically disperse control files onto different disk arrays and controller cards, along with other database files.
7) Software and static files. Like OLTP systems, these files should be placed on fault-tolerant disk configurations. Since very little write activity is present, these are again good candidates for RAID5.
Taking the above information into consideration, can an organization run an entire ORACLE database instance on a single RAID5 array? The answer is "yes". Will the organization get a good level of fault-tolerance? Again, the answer is "yes". Will the organization get acceptable performance? The answer is "it depends". This dependency includes the type of workload, the number of users, the throughput requirements, and a whole host of other variables. If the organization has an extremely limited budget, then it can always start with a single RAID5 array, perform the necessary analysis to see where improvement is needed, and proceed to correct the deficiencies.