Equilibrium IT Solutions - Chicago, IL
Technology Consulting

Trusted Technology Consulting

Disaster Recovery Plan
Call us today at 773.205.0200 to see how we can help you.
Network Infrastructure
IT Security Solutions Systems Infrastructure
Click Here To Request A Consultation
Disaster Recovery Architecture

According to a recently published study, companies that had a major loss of business data, 43% never reopen, 51% close within two years, and only 6% will survive long-term.

General Overview

Disaster recovery strategy and planning involves identifying and defining the process, policies and procedures related to recovering technology infrastructure that is mission critical to an organization after a disaster event.  Often this effort is part of larger companywide effort known as business continuity planning.

A formal Disaster Recovery Strategy should define the organization's RPO and RTO for specific business processes (e.g. payroll, order processing, etc.).

  • Recovery Point Objective (RPO) – What is an acceptable amount of data loss measured in time? For example: Recover a failed fileserver from the previous day’s tape = up to 1 day of data loss.
  • Recovery Time Objective (RTO) – What is an acceptable amount of time to recover a failed system or business process? For example: Restoring the fileserver from tape can take 8 hours.

The metrics specified for the business processes must then be mapped to the underlying IT systems and infrastructure that support those processes. Once the RTO and RPO metrics have been mapped to IT infrastructure, we can determine the best recovery strategy for each system.

The following is a list of the most common strategies for data protection.

  • Disk to Tape Backups
  • Disk to Disk to Tape Backups
  • Disk to Disk Backups Replicated to Off-Site Disk
  • Replication of Data to a Hot-Site location, which overcomes the need to restore the data (only the systems then need to be restored or synced). This generally makes use of Storage Area Network (SAN) technology. 
  • High availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data – for example Microsoft Exchange CCR - Continuous Clustering and Replication.

In addition to preparing for the need to recover systems, organizations must also implement precautionary measures with an objective of preventing a disaster situation in the first place. These may include some of the following:

  • Disk protection technology such as RAID 
  • Surge Protectors to minimize the effect of power surges
  • Uninterruptible Power Supply (UPS) and/or Backup Generator 
  • Fire Prevention measures, more accessible fire extinguishers 
  • Anti-virus software

Disaster Recovery Planning - Gather the facts about the current state IT environment.

Have a Detailed Network Diagram

A network diagram should be created as part of the Disaster recovery Strategy to understand the layout of the entire infrastructure and areas of concern. This diagram should depict all production hardware that affects daily activity. This should include routers, firewalls, switches, cabling, patch cables, servers, UPS’s, Environmental systems, Printers, Wireless access Points, and WAN Circuits.

Have an Accurate Hardware and Software Inventory

A database or spreadsheet should exist listing all hardware and software. This list should include specific information pertaining to that hardware or software. i.e. Serial numbers, key codes, service tags, location of the hardware or software, type of hardware or software, age of the hardware, etc. This list should be updated regularly to assure it is current.

Failure Mode Effects Analysis (FMEA) Planning

The FMEA is a risk assessment tool that helps systematically define where potential points of failure are located within a network, the critical nature of the problems and logically layout the plans to resolve them before they become a problem.

The purpose of the FMEA is to:

  • Attempt to resolve potential failures in order of critical nature. 
  • Systematically define the potential failures and laying out the plan to resolve them. 
  • Assist in the transformation of network management from reactive to proactive.

There are three indicators that collectively generate the priority ranking of a failure. Each is rated on a scale from 1 – 10.

  • Severity: This relates to the relative impact of a failure on the part of the infrastructure the device, system or software is associated. Business Impact Analysis -    Severity ranking is based on the extent of the outage. If the productivity of the entire network is affected the severity is higher than if a single user is affected. Each organization will have its own definition of what is a short term and a long term outage. 
  • Detection: This relates to the ability to effectively detect that a problem has or will occur. Controls and monitoring systems are key to increasing the capability of detecting a problem. If there is limited or no detection capability the detection ranking will be higher than if complete monitoring system with paging or email capability was operational. 
  • Occurrence: This relates to the probability that the failure will occur. There are several factors that can increase the occurrence probability. The age of device, systems or software can increase the probability that a failure will occur. Also if there is insufficient redundancy or a single point of failure the probability of a failure is higher. MTBF (Mean Time Between Failures) is known for most hardware on the market today and can help determine the probability of a failure. If the device, system or software is older, then the occurrence factor is going to be higher than if new devices, systems or software have been implemented.

Each failure mode has its own priority ranking. This allows the failures to be prioritized. The next step is to determine which index is causing the high priority ranking and attempt to reduce it.

  • Severity of a failure mode cannot change unless the design of the infrastructure changes. 
  • Detection can be reduced by implementing monitoring tools that email or page engineers when a problem has or will occurred. 
  • Occurrence can be reduced by upgrading devices, systems or software and/or eliminating single points of failure.

Equilibrium’s Service Offering for Disaster Recovery Strategy (DRS)/ Failure Mode & Effects Analysis (FMEA)

When Equilibrium develops a comprehensive Disaster Recovery Strategy report for a client we perform a Business Impact Analysis (BIA), create detailed network illustrations, update accurate hardware and software inventories, document installation and configuration procedures for all critical production hardware and software, develop a Failure Mode & Effects Analysis (FMEA) report, document failure flow charts, and hardware and software failure procedures.

Scope of Services:

  • Interview Management to explore and understand key business and technology requirements for Disaster Recovery.
  • Create an inventory of all production hardware and software pertinent to day to day business including the age of each device and if it has a current maintenance agreement. 
  • Create a Project Plan that lays out the entire process of creating the Disaster Recovery Strategy.
  • Define the participants for all meetings and generate a meeting schedule to minimize meeting conflicts. 
  • Generate a comprehensive Disaster Recovery Strategy manual that defines what to do in case of a disaster. This will include:
    • Current Network Diagram
    • Complete list of Hardware and Software pertinent to day to day business needs.
    • FMEA that defines all potential failure modes, their associated RPN (Risk Priority Number), and the Action Plan to reduce the significant offenders.
    • Failure flow charts to define the actions needed to find failed devices or software.
    • Failure Procedures to define the steps to resolve the failed devices or software.

Deliverables:

  • Disaster Recovery Strategy Manual (A playbook for recovering form system failures)
  • Action Plan which breaks down cost and time to complete.

Click here to request a consultation with one of our Senior Consultants or call us at 773-205-0200.

Case Studies
Insure on the Spot
Turnkey Office Build-Out & Relocation for Fast Growing Business
More  
Bill Jacobs Joliet & Plainfield
Smooth transition of support services, and a long term strategic IT Business Plan that better supports the business.
More  
Auto Insurance Center
Equilibrium co-develops an IT Business Plan for Auto Insurance Center.
More  
Learning Point Associates
IT Business Plan sets organization on a path to continue to provide world class systems and capabilities to users and clients.
More  
Bethesda Home and Retirement Center
Stability, reliability and responsiveness, Equilibrium delivers.
More  
Supported Technologies
Home | Privacy Policy | Terms & Conditions © 2010 Equilibrium, Inc. All Rights Reserved.
5080 N Elston Ave. Chicago, IL 773.205.0200