Preface
In our last Newsletters, we presented several parts that detailed the processes and methods required to define, characterize, and develop ITSM Service Continuity best practices and deal directly with Disaster Recovery Planning that included making up a Business Impact Analysis.
Introduction
This newsletter for November/December, 2003 is the fourth and final part that of the presentation to developing an effective ITSM Service Continuity, Disaster Recovery Plan.
What is ITSM Service Continuity?
In review, ITSM Service Continuity typically includes the following major elements:
Business Impact Analysis (First Newsletter)
Disaster Recovery Planning (Second and Third Newsletter)
Information Technology Infrastructure for Service Continuity (This Newsletter)
It should be noted that business impact analysis is included as the initial and critical purpose of any disaster recovery effort since a business impact analysis is necessary to provide input to a subsequent disaster recovery planning effort. Subsequently, disaster recovery is a main and critical component of ITSM Service Continuity.
Disaster Recovery Resource Requirements
In review, DRP resource requirements includes the following major elements, based on the business impact analysis findings, having an application system granularity level:
1) Enter the estimated maximum amount of technology required by this application system for normal operation
Amount of Processor, Memory, Disk Space (Gigs), Bandwidth, etc.
Number of Devices: Servers, Workstations, Disk, CD, Tapes, Printers, etc.
Number of Communications Devices: Routers, Hubs, etc.
Physical and Logical Configuration and Communication of Network.
Any other Special Considerations
2) List all databases/files by ID that are required to restart this application system after a total disaster. Include filename/ID, whether backups are stored onsite or offsite, and the frequency of the backup
3) List all actions or steps to restart this application system after a total disaster.
Disaster Recovery Business Requirements
In review, DRP business requirements includes the following major elements, based on the business impact analysis findings, having an application system granularity level:
1) Enter the Customer, Department, Date, Application System Name and Number (if applicable)
2) Enter the Disaster Recovery Priority assigned for this application and whether it is agreed by both Customer and IT. List the reasons if there is disagreement.
3) List the Business Functions supported by this application.
4) Operating impact if this application is lost.
5) Legal and regulatory impact (if applicable) if this application is lost.
6) Financial impact - Cost to the Agency, organization, or loss of profits if this application is lost. For 1 hour, 1 day, 2 days, and 1 week. Describe how this was calculated.
7) Intangible losses if this application is lost.
8) What are the important time cycles for this application (if any)?
9) Could you use alternative methods to replace this application for a temporary period of time? If so, for how long?
10) Would the loss of this application affect Customers? If so, then how and to what extent?
11) If a disaster caused a loss of data entered the past day, two days, etc., how far back could that data be obtained and reentered? If so, then would this be done and how long would that take?
ITSM Service Continuity Methods and Best Practices
In review, the disaster recovery planning methodology consists of many disaster recovery planning processes. The contemplated disaster recovery planning effort, which will be performed after the disaster recovery business impact analysis, will be accomplished utilizing the disaster recovery planning processes, which are summarized below. The 14 important ITSM Service Continuity Information Areas are:
1) Overview
2) Application Inventory and Criticality Ratings
3) Disaster Recovery Teams
4) Off-Site Backups
5) Critical Application System Resource Requirements
6) Alternate Processing Strategies and Inventories Required
7) Operating Procedures
8) Critical Application Systems Recovery
9) Vendor Contact Information
10) Support Considerations
11) Facility, Office and Reconstruction
12) Training
13) Testing
14) Maintenance and Review
RL Consulting IT Service Management Tools and Workshops
RL Information Consulting Announces an ITSM Self Assessment Solution Package
» Documented Proprietary Methodology
» Proprietary Excel Spreadsheet
» GAP Analysis
» Criticality Matrix
» Evaluation Checklist
» Orientation Presentation
CLICK HERE - For Additional Information and Pricing
Utilizing SolutionMethod proprietary questions and evaluation processes, RL Consulting employs 3 automated tools; ITSM Assessment Worksheet, ITSM ROI Worksheet, and ITSM Implementation Readiness Worksheet.
Education and Training - 1, 2, and 3 Day ITSM Workshops that offer a broad range of training beginning with Management level, through Strategy and Planning, to Implementation.
To Learn More About These Tools and Workshops and How They Can Help You
- Click Here -
Find Out About Our 4 Week ITSM Maturity Assessment Service and How It Can Assist You
- Click Here -
|
IT Infrastructure and Service Continuity
There are many conflicting disaster recovery design objectives when considering an IT infrastructure for Service Continuity. Consequently, an individual solution is always a compromise, and there is no single solution or recommendation that fits all.
The decision criteria are the cost of the disaster recovery solution, disaster coverage and residual risk, speed of recovery, and completeness of recovery and data integrity. Between any two of these criteria, there is a certain conflict and therefore some trade-off must be found.
Within the scope of the involvement of the Service Continuity effort, the existing IT infrastructure as well as current available and proven technology are reviewed and evaluated.
Recommendations will evolve from this analysis to arrive at the desired solution, which will implement and support the selected disaster recovery strategy in a cost-effective manner.
This section provides some background related to the disaster recovery design and IT infrastructure.
1) Data Backup and Recovery
Information is critical to the survival of any business. Of all IT resources, data is the most important. Other resources, such as processing power, software, disk storage, and building facilities are all ultimately replaceable, but much data is not. Data is also the most volatile and complex of all IT resources.
This complexity and volatility of data makes it the most difficult resource to manage during recovery. Whereas the relatively static nature of hardware and building facilities enables sites to be ready before recovery is necessary, the volatile nature of data means it must be managed as an ongoing process. Either data is kept current at the recovery site or it must be made current as part of the recovery process.
Data can be divided into data managed by a database management system (DBMS) and other data that is not. DBMS data is more complex in its structures and requires special consideration for backup. Consequently, most database management systems provide utilities for this purpose. Non-DBMS managed data can be even more challenging for disaster recovery.
Any disaster recovery plan should assume some loss of data. This data loss can be planned or unplanned. Planned data loss is inherent in the chosen disaster recovery strategy. There is also the possibility that data will be lost inadvertently through error, either in the design of the recovery process or its execution.
Generally, data is recovered by restoring a copy of the data taken at some previous time and then applying any necessary updates to it. This approach assumes that the updates made to the prime copy of the data are repeatable such that the equivalent updates can be made to the remote copy at some later time. In a database management system this is achieved by logging all updates to the log data set. This method protects the data up to the point of the latest safe log data set.
When a database management system is not used, other methods of repeating these updates must be employed. In some cases it may be possible to rerun the original transactions, or to capture information on the updates in another way, such as at the workstation or on data entry forms. It may be possible to rebuild the actual data from other sources.
If none of the above techniques are available periodic backups alone will have to be used. This may be acceptable if the data is very static, if the updates occur at regular and predictable intervals, or if some level of data loss is acceptable. In all other cases, the result will be an unacceptable data loss.
When reviewing the backup and logging needs of data, it is important to consider whether the data can be recreated, whether it is volatile, whether the updates are predictable, whether the data is important, what the acceptable window is for recovery, and how much data loss is acceptable before deciding upon the method of backup, logging, and recovery.
2) Disaster Recovery Concepts
It is desirable that the time between a disaster and the beginning of production at the recovery site is short. This is possible if the second site is ready at all times, with all required hardware installed, and with all data at a very current level. The degree to which this goal can be achieved depends on the recovery concept implemented at the recovery site:
Cold backup - This is a recovery site that is equipped with an IT infrastructure, such as a raised floor, air conditioning, and network connections, but without IT equipment installed.
Warm backup - This is a recovery site that is operational and available for takeover after some delay. The delay may be caused by the time it takes to restore data.
Hot backup - This is a recovery site that is operational. Some or all application data is on-line, such that the time required to prepare the recovery site for takeover is much reduced.
To become effective, the above definitions should apply to the disaster recovery strategy adopted by the organization. As such these definitions do not accurately describe a given disaster recovery strategy. A precise description of the readiness of a recovery site has to consider the application systems. Main aspects are the backup method and the techniques used to manage backup data.
The readiness level of an application system may be:
No provision is made for disaster recovery.
Periodic backup - The installation will at certain times take a consistent copy, which allows recovery to that point, and send it to an off-site location.
Semi-Roll-Forward - In addition to periodic backups, update logs are also sent to the off-site location. Transport may be physical or electronic. Recovery will be to the last log data set received.
Roll-Forward - A shadow copy of the data is maintained at the recovery site. Update logs are periodically applied to the shadow copy through recovery utilities. Transmission may be physical or electronic.
Real-Time-Roll-Forward - Similar to roll-forward, except that updates are transmitted and applied at the same time they are being logged in the production site. This near real time transmission and application of log data would not impact transaction response time at the production site.
Real-Time-Remote-Update - This is the capability to update both the primary and shadow copy of data, prior to sending transaction response or completing a transaction.
3) Interconnection Technology
The primary and recovery site need not necessarily be interconnected for disaster recovery. Data on tape, for instance, may be regularly carried to the recovery site. Still, in order to provide the best possible data protection as well as the quickest possible takeover in the case of a disaster, data must be transmitted off-site as fast and frequently as possible. This is best achieved by using high-bandwidth interconnection technology. Apart from cost, distance is a limiting factor here.
Interconnection technology available today has currently limited distance capability. Emerging technology such as NAS and SAN be employed and may extend this capability in the future, but the highest bandwidth interconnection technology will always imply a certain distance limitation.
With the recovery site not interconnected, and to facilitate disaster recovery, backup data must be physically transported to the recovery site. This method of data transportation is well suited to very large amounts of data. It is also suitable for any information stored on paper, as there will always be a need to regularly ship documents to the recovery site.
Disadvantages are that there is a great potential for data loss in case of a disaster; the data is "unsafe" as long as the media is queued for transport. Also, third parties may be involved and this introduces potential management and security issues. At some point tape volumes must be tracked and recycled.
With a network connection to the recovery site (NAS and SAN), the network allows some data transmission and user access to the recovery site. In most cases this will not eliminate the need for physical data transport. The network connection provides a means for moving data to the recovery site as soon as it is made transferable and avoids the disadvantages of physical transportation.
Remote tape has the advantage of placing backup data in a safe state without an intermediate period of being transferable. This can be done with a transmission bandwidth in excess of a network solution. It facilitates the integration of in-house backup and disaster recovery backup.
Remote disk may be used to maintain backup data in a safe and usable state. It allows remote copies of critical data to be maintained at a very current level, providing warm or even hot backup.
4) Distance from Primary Site
If for example a fire is the only kind of disaster to worry about, the recovery site can be as close as two machine rooms in the same building, isolated from each other by a fireproof wall. However, disasters may affect more than a single machine room or building.
5) Distance and Disaster Scope
The disaster recovery scope may be such that a whole geographical area is affected. Therefore, a greater distance between IT sites results in greater security against any widespread disaster. Greater distance however has its price in terms of interconnection, relocation effort, etc. With current available technology, it is neither easy nor cheap to interconnect two sites over hundreds of kilometers at a bandwidth high enough to keep large amounts of data at both sites fully synchronized.
A moderate distance may give less protection against wide-spread disasters, but has advantages such as:
High interconnection bandwidth - In this context, high bandwidth provides links in the megabit or gigabit range for remote attachment of tape drives and disk.
Low interconnection cost - The cost of interconnections is a function of the distance covered. the cost of connections may be prohibitive over a long distance.
Relocation in case of a disaster - From a network connection viewpoint the recovery site may be relocated anywhere. If there is considerable physical interaction with the primary IT center, such as central printing and diskette or magnetic tape exchange, operations might not function at a remote recovery site.
Real-time-remote-update using interconnection technology - Hot backup strategies are not easy to implement over long distances. They require that a similar system is active at the recovery site. The primary site transmits database changes to the recovery system which applies them in real time to a shadow database. Most of this technology must be implemented within application systems and requires that both systems are available at any one time. Functions like error recovery, synchronization, and takeover, are usually extremely difficult to implement. It is much easier to attach the shadow database to the primary system using interconnection technology, where. both the primary and the mirror database are controlled from the primary system. This depends on the ability of the application and/or system software to keep the shadow copy updated.
In-house disaster backup - Two separate sets of backup copies normally have to be maintained. One is required for recovery at the primary site to rebuild a database in the case of an error. The second is required for disaster recovery and should be taken off-site as soon as it is created. In many cases, the database system must be taken down during the backup process. The time available to do this can be very limited and maintaining two redundant backup processes may be a major problem. A solution to this problem may be to use remote tape for backup. Tape drives could be physically located at the recovery site, and connected to the primary site using interconnection technology links. These backups are immediately available for in-house recovery as well as for remote recovery in the event of a disaster. The distance between the primary and recovery sites being a given, technology that is currently available and proven must be reviewed and adopted to implement and support the disaster recovery strategy.
6) Network Infrastructure
A network capable of functioning after a disaster at the primary site is a major requirement in disaster recovery design.
In addition to handling the workload generated by user access to the primary site, a network designed for disaster recovery must provide the following:
Data shipment between sites - There will be a requirement for a constant data exchange between both sites. This data may include remote operation, system and application changes, systems management, database logs, and database backup copies. If complete files or database backup copies are to be sent across the network, it will require extremely high capacity transmission links. Current available interconnection technology does not provide a bandwidth of that range.
Capacity required in the event of a disaster - In the event of disaster, the resulting workload takeover can cause major shifts in network traffic. Users may change from local to remote attachment, and remote users may require significantly different transmission routes through the network.
7) Network Topology
A network topology that provides a disaster recovery capability must to have the following properties:
Network access for both sites - This is required both for workload takeover, as well as for periodic testing.
Uses separate gateways, one for each site - In the context of disaster recovery, a network gateway machine may be a single point of failure. Disaster recovery requires that a second gateway exists, and that these gateways are not located close to each other. The goal is that the recovery site does not lose its network access when a disaster occurs at the location of the primary site's gateway.
Both sites must be able to control the network - Large networks are typically controlled from one location. In the event of a disaster at the controlling location, a provision must exist to move the network control function to another site.
Isolated external paths - Despite the fact that two sites exist that each have their own network gateway, a single point of failure may exist in the external connection between the gateways and a public telephone interchange facility. Ideally, both gateways should be connected to separate public telephone interchange facilities.
Alternate paths for all locations - The network topology should provide alternate paths between host locations and all remote user locations.
Automatic path switching - Wherever path switching is required in the event of disaster recovery, it should take place automatically, or under central control. Remote users are typically not trained or sufficiently experienced to perform this task reliably.
|
Is Your Company Triple "A" Rated?
Aware - Of internal and external factors that could impact your business
Adaptable - To change the focus of your company's resources and expertise depending on changes in internal and external factors
Agile - To focus your company's resources and expertise effectively and efficiently in the quickest manner possible
What if you could do this proactively?
RL Consulting Utilizes:
The Proactive Business Model, Realization of Benefits, ITSM ROI, SAGA Business Strategy, and Enterprise Infrastructure Architecture
Leveraging Knowledge Management to Optimize Decision Support Systems Proactively
One step beyond IT Service Management,
One Generation beyond Information Systems
|
Downloads
White Papers:
» Leveraging ITSM and Six Sigma
» Project Mgt for Application Development
» ITSM Managed Services Methodology
» ITSM Outsourcing Questionnaire
» SLA Description and Templates
» Consolidation Questionnaires
» Service Continuity Methods
» Project Management Practices
» Developing a Communication Plan
» Data Management Process
» Proactive Business Model
» Realization of Benefits
» SAGA Business Strategy
» Enterprise Infrastructure Architecture
Service Briefs:
» ITSM Maturity Assessment
» Incident and Problem Management
» Service Continuity
» Configuration and Change Management
» Service Level Management
» Capacity Management
» Availability Management
» Release Management
- Click Here -
|
Additional Information
Visit www.itsm.info
to learn about ITSM, IT Infrastructure
Library (ITIL), and SolutionMethod (a
Policy Based ITSM Approach). In
addition, you can download free and
informative white papers,
questionnaires, and service briefs. This includes more in-depth information on the topics presented in this newsletter.
To learn how RL Consulting can assist in achieving IT Service Management goals and our full range of solutions:
Contact us at RL_Consulting@ITSM.info or phone us at 602-996-6830
Tell a fellow IT Professional
Sign up for our Monthly ITSM Newsletter
* In the upcoming months this newsletter
will contain important information
concerning the various aspects of IT
Service Management. If you no longer
wish to receive this newsletter simply
reply to this message with
"REMOVE" in the subject line.
Volume 8 - November/December, 2003
This newsletter and the information contained herein is maintained by Rick Leopoldi and property of RL Information Consulting LLC.
|