Introduction to Disaster Recovery ================================= Keeping a business operational increasingly means keeping critical data and information systems available around the clock. To compete successfully in the global marketplace, companies are striving to protect critical information systems to help minimise costly business impacts, such as lost sales, decreased customer satisfaction and reduced employee productivity. Even if you configure your systems to incorporate highly available hardware and software, some events such as natural disasters or power outages may remain outside of your control. Part of ensuring a high-availability system, then, is planning ahead for a swift recovery from a disater you cannot predict or prevent that causes a site or building to be unusable. One way to protect critical data as part of a location disaster recovery plan is to duplicate the most up-to-date data reliably and quickly at a remote location. High Availability Geographic Cluster (HAGEO) ============================================ The High Availability Geographic Cluster (HAGEO) helps keep mission-critical systems and applications operational in the event of disasters such as power outages, fires, floods and other natural disasters. This is accomplished by eliminating the system and the site as points of failure. Designed for the high-end e-business, this software enables remote hot backup of both your data and equipment by eliminating the existence of a single point of failure within your operations. HAGEO allows you to mirror data and applications in a different geographic location from your primary operation, keeping mirrored data updated in real-time. In the event of an unpredicted power outage, flood, or other disaster, HAGEO works in conjunction with HACMP or HACMP/ES to automatically transfers data and workloads to the remote location seamlessly. The key technology by which this is achieved is a geographic mirror. This provides either synchronous or asynchronous realtime data mirroring across unlimited geographic distances. Geographic Remote Mirror (GeoRM) ================================ For companies with mission-critical e-business systems, but with less demanding availability requirements, IBM offers an efficient and effective disaster recovery solution. The Geographic Remote Mirror (GeoRM) software utilises the same remote mirroring technology as HAGEO but does not provide automatic failover in the event of a disaster. It is suitable for all environments, from small and medium-sized companies to large enterprises. Running on AIX, GeoRM enables you to mirror your important business or transaction data to a remote geographic location, thereby diminishing the risk associated with downtime. Configurations using HAGEO or GeoRM consist of two geographically dispersed sites containing at least one RS/6000 or pSeries server. Communication links for HAGEO and GeoRM can be point-to-point or packet-switched networks supporting TCP/IP. Two separate physical paths are advised to prevent the link from being a single point of failure. Performance of a geomirror is based on the bandwidth and latency of these communication links. One means of determining the bandwidth requirements would be to use the gmdsizing tool. Using the gmdsizing tool ======================== gmdsizing is a command used to estimate network bandwidth requirements for the GeoPrimary networks used to support geomirror traffic. It monitors disk utilisation over a given period of time and prints a report. This report can then be used to help determine the bandwidth needs. Command syntax ============== gmdsizing -i interval -t time {[-p pv [-p pv]...] | [-v vg [-v vg]...]} [-f filename ] [-T] [-A] [-w] [-D char] [-V] [-h] where: -i interval Interval at which disk activity is checked. -t time Time period the command should measure. This defaults to seconds. The minimum number of seconds is 10. The value can be appended with the following letters to change the unit of time: d number of days h number of hours m number of minutes s number of seconds For example, to check over 5 days, you could use 5d, 120h, or 7200m. -p pv Names of physical disks to monitor. -v vg Names of volume groups to monitor. -f filename File in which to write the report, the default is stdout. -T Add time scale to the output. -A Aggregate the output. -w Collect data for write activities only. -D char Use 'char' as delimiter in the output. -V Verbose mode. Adds summary at end of the report. -h Print the Help message. Description =========== Use the gmdsizing utility to evaluate current disk usage on your system. The utility supplies information that will help you choose networks with the necessary bandwidth for use as HAGEO networks. This utility works like UNIX performance monitoring programs (for example iostat, vmstat, or netstat). When you run the command, you provide it either a set of physical disks or a set of volume groups to monitor, the amount of time you want to measure for, and the measurement interval. When the command completes or is interrupted, it produces a report on disk usage over the time the command was running. If you specify the -v option, the gmdsizing utility converts the volume group argument into physical volume names. Example ======= To measure the disk utilisation on the datavg volume group for 24 hours at 5 minute intervals, producing a report with a summary in the file /tmp/datavg.out use the command: gmdsizing -i 5m -t 24h -f /tmp/datavg.out -v datavg -V Sample Report ============= The gmdsizing command summary report includes information on the average number of disk blocks written per interval, as well as the maximum and minimum values. For an example summary report, to show the traffic on the app1vg volume group over 12 seconds at intervals of 3 seconds, run the command: gmdsizing -i 3 -t 12 -v app1vg -V The resulting report looks similar to the following: Disk Reads Writes hdisk4 2944 0 =================================================================== Disk Reads Writes hdisk4 2816 0 =================================================================== Disk Reads Writes hdisk4 3072 0 =================================================================== Disk Reads Writes hdisk4 3072 0 =================================================================== block total minimum maximum Disk size read write read write read write hdisk4 512 11904 0 2816 0 3072 0 When and what to measure ======================== It is important to measure disk utilisation at a representative time. If the system is running a nine to five operation, measuring disk utilisation in the middle of the night will not give much information. Similarly, measuring over a very short period of time is likely to give unrepresentative data. You need to understand how a workload varies over time. This typically includes: - Busy periods during a day / week or month - Year end / quarter end processing Do not forget to take into account any overnight batch processing activities. Batch processing is often much more write intensive than online work. Having decided when you are going to run gmdsizing to collect representative data, it is important to choose the right parameters to make the data easier to analyse. It is better to run gmdsizing over a longer period of time than a shorter one. This is more likely to observe peaks and troughs that monitoring for a shorter period might miss. When specifying the observation interval however, it is better to keep this larger rather than smaller. One line of data will be written per disk per interval so a very large amount of data will be collected if you have a small interval and/or a large number of disks. Remember that the more data you collect, the more data you will have to process. A typical use of gmdsizing might be: gmdsizing -V -i 5m -t 24h -v datavg -f /tmp/datavg.gmdsizing.out Always remember to specify an output file with the -f flag. Always use the -V flag. This will provide a summary at the end of the output file. This summary is very useful for a quick analysis of the data. Finally, remember to keep a record of the command that you used to collect the data. If the system you are measuring already has local LVM mirroring enabled, remember to take this into account when selecting what to measure. gmdsizing is measuring write activity. If you have two-copy mirroring enabled for the logical volumes, you will be doing twice as many writes at the physical disk level. One logical write from the application generates two physical writes to the disk devices. Rather than selecting an entire volume group to be monitored, select just those disks that contain one copy of the mirrors. If your volume group is laid out such that you cannot easily do this, then remember that you are potentially recording twice as much write activity as you are actually generating from the application perspective and take this into account when analysing the data. Understanding the output ======================== Look at the summary report first. The summary will look something like this: Block total minimum maximum Disk size read write read write read write hdisk3 512 172000 13760 0 0 65 198 hdisk4 512 19060 6760 11 65 100 122 hdisk5 512 210050 18820 0 0 1900 650 hdisk6 512 13290 8930 20 15 1761 984 hdisk7 512 172090 19760 0 0 65 198 hdisk8 512 19060 17220 11 65 100 122 hdisk9 512 210050 18218 0 0 1900 650 hdisk10 512 13290 18006 20 15 1761 984 hdisk11 512 16753 8526 18 89 988 480 For the purposes of sizing the networks, we are only interested in write traffic, but the read columns are useful to help determine the read:write ratio for the workload. Knowing how long a the disk activity was measured for (the parameter passed with the -t flag) allows determination of an average write rate. All data reported by gmdsizing is given in disk blocks. To convert to bytes, multiply the total of all the 'total write' values by the block size for the device. This value may then be divided by the total time over which we were measuring. From the data above, 130000 blocks (sum of the total write values for all disks) x 512 bytes/block (from the block size values) ------ 66560000 bytes (total volume of data written) The measurement period used to collect the above data was 30 minutes. So, 66560000 / 1800 seconds = 36977 bytes/second Remember that this is an average, and it very much assumes that data is written at a constant rate. This is unlikely. To get a figure for how accurate this average is, compare this with the theoretical 'worst case' scenario. This is the maximum volume of data written in one measurement interval (defined by the -i flag). Dividing this figure by the measurement interval gives us a worst case rate. In this example: 4388 block x 512 bytes / block ----- 2246656 bytes 2246656 / 60 seconds = 37444 bytes / second Compare these two values. If they are relatively close (as in this example), the disk activity and hence network bandwidth requirements, are fairly uniform. These values can probably be used to get a good estimate for the network requirements. If these values differ widely, as is more likely (having a 'worst case' value between 7 and 10 times that of the average is not unusual) a more detailed analysis of the gmdsizing data to determine the bandwidth requirements will be needed. Plotting the gmdsizing data against time on a graph will allow you to approximate a value by inspection. If a more formal analysis of the data is needed, then this will require statistical analysis. There is a mathematical function that will unambiguously determine the best straight line through a field of data points called linear regression. Linear regression uses the least squares method to plot the line. Most calculators, spreadsheet and graphing software packages will have a linear regression function included. More complex analysis of the data may be performed using other mathematical techniques as time or inclination allows. The network bandwidth determined by these methods does not take networking overheads into account. Allowing a 20% overhead for the networking protocols should be sufficient for the majority of networks.