Introduction to Disaster Recovery
=================================

Keeping a business operational increasingly means keeping critical data and
information systems available around the clock.  To compete successfully in the
global marketplace, companies are striving to protect critical information 
systems to help minimise costly business impacts, such as lost sales, decreased
customer satisfaction and reduced employee productivity. 

Even if you configure your systems to incorporate highly available hardware and
software, some events such as natural disasters or power outages may remain 
outside of your control.  Part of ensuring a high-availability system, then, is
planning ahead for a swift recovery from a disater you cannot predict or 
prevent that causes a site or building to be unusable.

One way to protect critical data as part of a location disaster recovery plan 
is to duplicate the most up-to-date data reliably and quickly at a remote 
location.


High Availability Geographic Cluster (HAGEO)
============================================

The High Availability Geographic Cluster (HAGEO) helps keep mission-critical 
systems and applications operational in the event of disasters such as power 
outages, fires, floods and other natural disasters.  This is accomplished by 
eliminating the system and the site as points of failure. 

Designed for the high-end e-business, this software enables remote hot backup 
of both your data and equipment by eliminating the existence of a single point 
of failure within your operations.  HAGEO allows you to mirror data and 
applications in a different geographic location from your primary operation, 
keeping mirrored data updated in real-time.  In the event of an unpredicted 
power outage, flood, or other disaster, HAGEO works in conjunction with HACMP
or HACMP/ES to automatically transfers data and workloads to the remote 
location seamlessly.

The key technology by which this is achieved is a geographic mirror.  This 
provides either synchronous or asynchronous realtime data mirroring across 
unlimited geographic distances.


Geographic Remote Mirror (GeoRM)
================================

For companies with mission-critical e-business systems, but with less demanding
availability requirements, IBM offers an efficient and effective disaster 
recovery solution.  The Geographic Remote Mirror (GeoRM) software utilises the 
same remote mirroring technology as HAGEO but does not provide automatic 
failover in the event of a disaster.  It is suitable for all environments, from
small and medium-sized companies to large enterprises.  Running on AIX, GeoRM
enables you to mirror your important business or transaction data to a remote 
geographic location, thereby diminishing the risk associated with downtime.

Configurations using HAGEO or GeoRM consist of two geographically dispersed
sites containing at least one RS/6000 or pSeries server.  
Communication links for HAGEO and GeoRM can be point-to-point or packet-switched
networks supporting TCP/IP.  Two separate physical paths are advised to prevent
the link from being a single point of failure.  

Performance of a geomirror is based on the bandwidth and latency of these 
communication links.  One means of determining the bandwidth requirements 
would be to use the gmdsizing tool.

Using the gmdsizing tool
========================

gmdsizing is a command used to estimate network bandwidth requirements for
the GeoPrimary networks used to support geomirror traffic.  It monitors disk 
utilisation over a given period of time and prints a report.  This report can
then be used to help determine the bandwidth needs.

Command syntax
==============

gmdsizing -i interval -t time {[-p pv [-p pv]...] | [-v vg [-v vg]...]}
          [-f filename ] [-T] [-A] [-w] [-D char] [-V] [-h]
where:

-i interval	Interval at which disk activity is checked.

-t time 	Time period the command should measure. This defaults to 
		seconds. The minimum number of seconds is 10. The value can 
		be appended with the following letters to change the unit of 
		time:
			d number of days
			h number of hours
			m number of minutes
			s number of seconds
	
		For example, to check over 5 days, you could use 
		5d, 120h, or 7200m.

-p pv		Names of physical disks to monitor.

-v vg 		Names of volume groups to monitor.

-f filename 	File in which to write the report, the default is stdout.

-T		Add time scale to the output.

-A		Aggregate the output.

-w		Collect data for write activities only.

-D char		Use 'char' as delimiter in the output.

-V 		Verbose mode. Adds summary at end of the report.

-h 		Print the Help message.


Description
===========

Use the gmdsizing utility to evaluate current disk usage on your system. The 
utility supplies information that will help you choose networks with the 
necessary bandwidth for use as HAGEO networks. This utility works like UNIX 
performance monitoring programs (for example iostat, vmstat, or netstat).
 
When you run the command, you provide it either a set of physical disks or a 
set of volume groups to monitor, the amount of time you want to measure for, 
and the measurement interval.  When the command completes or is interrupted, it
produces a report on disk usage over the time the command was running.  If you 
specify the -v option, the gmdsizing utility converts the volume group argument
into physical volume names.  


Example
=======

To measure the disk utilisation on the datavg volume group for 24 hours at 5
minute intervals, producing a report with a summary in the file /tmp/datavg.out
use the command:

	gmdsizing -i 5m -t 24h -f /tmp/datavg.out -v datavg -V 


Sample Report
=============

The gmdsizing command summary report includes information on the average number of disk blocks written per interval, as well as the maximum and minimum values.

For an example summary report, to show the traffic on the app1vg volume group 
over 12 seconds at intervals of 3 seconds, run the command:

	gmdsizing -i 3 -t 12 -v app1vg -V

The resulting report looks similar to the following:

   	Disk        Reads      Writes

   	hdisk4      2944       0
   	===================================================================
   	Disk        Reads      Writes

   	hdisk4      2816       0
   	===================================================================
   	Disk        Reads      Writes
	
   	hdisk4      3072       0
   	===================================================================
	Disk        Reads      Writes

   	hdisk4      3072       0
   	===================================================================
               	block         total          minimum        maximum
   	Disk        size       read   write   read   write   read   write
   	hdisk4      512        11904    0     2816     0     3072     0


When and what to measure
========================

It is important to measure disk utilisation at a representative time.  If the
system is running a nine to five operation, measuring disk utilisation in the
middle of the night will not give much information.  Similarly, measuring over
a very short period of time is likely to give unrepresentative data.  You need
to understand how a workload varies over time.  This typically includes:

- Busy periods during a day / week or month
- Year end / quarter end processing

Do not forget to take into account any overnight batch processing activities.
Batch processing is often much more write intensive than online work.

Having decided when you are going to run gmdsizing to collect representative
data, it is important to choose the right parameters to make the data easier to
analyse.

It is better to run gmdsizing over a longer period of time than a shorter one.
This is more likely to observe peaks and troughs that monitoring for a shorter
period might miss.  When specifying the observation interval however, it is 
better to keep this larger rather than smaller.  One line of data will be 
written per disk per interval so a very large amount of data will be collected 
if you have a small interval and/or a large number of disks.  Remember that the
more data you collect, the more data you will have to process.

A typical use of gmdsizing might be:

	gmdsizing -V -i 5m -t 24h -v datavg -f /tmp/datavg.gmdsizing.out

Always remember to specify an output file with the -f flag.  Always use the -V 
flag.  This will provide a summary at the end of the output file.  This summary
is very useful for a quick analysis of the data.  Finally, remember to keep a
record of the command that you used to collect the data.  

If the system you are measuring already has local LVM mirroring enabled, 
remember to take this into account when selecting what to measure.  gmdsizing 
is measuring write activity.  If you have two-copy mirroring enabled for the 
logical volumes, you will be doing twice as many writes at the physical disk 
level.  One logical write from the application generates two physical writes to
the disk devices.  Rather than selecting an entire volume group to be monitored,
select just those disks that contain one copy of the mirrors.  If your volume 
group is laid out such that you cannot easily do this, then remember that you 
are potentially recording twice as much write activity as you are actually 
generating from the application perspective and take this into account when
analysing the data.  


Understanding the output
========================

Look at the summary report first.  The summary will look something like this:

	Block	   total	   minimum	    maximum
Disk	size	read	write	read	write	  read	  write

hdisk3	512	172000  13760  	  0	0	    65	    198
hdisk4	512	19060	 6760	 11	65     	   100	    122
hdisk5	512	210050  18820	  0	0	  1900	    650
hdisk6	512	13290	 8930	 20	15	  1761	    984
hdisk7	512	172090  19760	  0	0	    65	    198
hdisk8	512	19060	17220	 11	65	   100	    122
hdisk9	512	210050 	18218	  0	0	  1900	    650
hdisk10	512	13290 	18006	 20	15	  1761	    984
hdisk11	512	16753 	 8526	 18	89	   988	    480

For the purposes of sizing the networks, we are only interested in write 
traffic, but the read columns are useful to help determine the read:write ratio
for the workload.  Knowing how long a the disk activity was measured for 
(the parameter passed with the -t flag) allows determination of an average
write rate.  All data reported by gmdsizing is given in disk blocks.  To convert
to bytes, multiply the total of all the 'total write' values by the block size 
for the device.  This value may then be divided by the total time over which we were measuring.  From the data above, 

  	130000 blocks		(sum of the total write values for all disks)
  	x  512 bytes/block 	(from the block size values)
	------
      66560000 bytes		(total volume of data written)

The measurement period used to collect the above data was 30 minutes.  So,

      66560000 / 1800 seconds = 36977 bytes/second

Remember that this is an average, and it very much assumes that data is written
at a constant rate.  This is unlikely.  To get a figure for how accurate this
average is, compare this with the theoretical 'worst case' scenario.  This is
the maximum volume of data written in one measurement interval (defined by the
-i flag).  Dividing this figure by the measurement interval gives us a worst 
case rate.  In this example:

	4388 block
       x 512 bytes / block
       -----
     2246656 bytes

     2246656 / 60 seconds = 37444 bytes / second

Compare these two values.  If they are relatively close (as in this example), 
the disk activity and hence network bandwidth requirements, are fairly uniform.
These values can probably be used to get a good estimate for the network 
requirements.  If these values differ widely, as is more likely (having a 
'worst case' value between 7 and 10 times that of the average is not unusual)
a more detailed analysis of the gmdsizing data to determine the bandwidth 
requirements will be needed.

Plotting the gmdsizing data against time on a graph will allow you to 
approximate a value by inspection.  If a more formal analysis of the data is
needed, then this will require statistical analysis.  There is a mathematical 
function that will unambiguously determine the best straight line through a 
field of data points called linear regression.  Linear regression uses the 
least squares method to plot the line.  Most calculators, spreadsheet and
graphing software packages will have a linear regression function included.  

More complex analysis of the data may be performed using other mathematical 
techniques as time or inclination allows.

The network bandwidth determined by these methods does not take networking 
overheads into account.  Allowing a 20% overhead for the networking protocols
should be sufficient for the majority of networks.
