Disk to Disk Backups Replicated to Off-Site Disk
Replication of Data to a Hot-Site location, which overcomes the need to restore the data (only the systems then need to be restored or synced). This generally makes use of Storage Area Network (SAN) technology.
High availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data – for example Microsoft Exchange CCR - Continuous Clustering and Replication.
- Disk to Tape Backups
Disk to Disk to Tape Backups
In addition to preparing for the need to recover systems, organizations must also implement precautionary measures with an objective of preventing a disaster situation in the first place. These may include some of the following:
- Disk protection technology such as RAID
- Surge Protectors to minimize the effect of power surges
- Uninterruptible Power Supply (UPS) and/or Backup Generator
- Fire Prevention measures, more accessible fire extinguishers
- Anti-virus software
Disaster Recovery Planning - Gather the facts about the current state IT environment.
Have a Detailed Network Diagram
A network diagram should be created as part of the Disaster recovery Strategy to understand the layout of the entire infrastructure and areas of concern. This diagram should depict all production hardware that affects daily activity. This should include routers, firewalls, switches, cabling, patch cables, servers, UPS’s, Environmental systems, Printers, Wireless access Points, and WAN Circuits.
Have an Accurate Hardware and Software Inventory
A database or spreadsheet should exist listing all hardware and software. This list should include specific information pertaining to that hardware or software. i.e. Serial numbers, key codes, service tags, location of the hardware or software, type of hardware or software, age of the hardware, etc. This list should be updated regularly to assure it is current.
Failure Mode Effects Analysis (FMEA) Planning
The FMEA is a risk assessment tool that helps systematically define where potential points of failure are located within a network, the critical nature of the problems and logically layout the plans to resolve them before they become a problem.
The purpose of the FMEA is to:
- Attempt to resolve potential failures in order of critical nature.
- Systematically define the potential failures and laying out the plan to resolve them.
- Assist in the transformation of network management from reactive to proactive.
There are three indicators that collectively generate the priority ranking of a failure. Each is rated on a scale from 1 – 10.
- Severity: This relates to the relative impact of a failure on the part of the infrastructure the device, system or software is associated. Business Impact Analysis - Severity ranking is based on the extent of the outage. If the productivity of the entire network is affected the severity is higher than if a single user is affected. Each organization will have its own definition of what is a short term and a long term outage.
- Detection: This relates to the ability to effectively detect that a problem has or will occur. Controls and monitoring systems are key to increasing the capability of detecting a problem. If there is limited or no detection capability the detection ranking will be higher than if complete monitoring system with paging or email capability was operational.
- Occurrence: This relates to the probability that the failure will occur. There are several factors that can increase the occurrence probability. The age of device, systems or software can increase the probability that a failure will occur. Also if there is insufficient redundancy or a single point of failure the probability of a failure is higher. MTBF (Mean Time Between Failures) is known for most hardware on the market today and can help determine the probability of a failure. If the device, system or software is older, then the occurrence factor is going to be higher than if new devices, systems or software have been implemented.
Each failure mode has its own priority ranking. This allows the failures to be prioritized. The next step is to determine which index is causing the high priority ranking and attempt to reduce it.
- Severity of a failure mode cannot change unless the design of the infrastructure changes.
- Detection can be reduced by implementing monitoring tools that email or page engineers when a problem has or will occurred.
- Occurrence can be reduced by upgrading devices, systems or software and/or eliminating single points of failure.
Equilibrium’s Service Offering for Disaster Recovery Strategy (DRS)/ Failure Mode & Effects Analysis (FMEA)
When Equilibrium develops a comprehensive Disaster Recovery Strategy report for a client we perform a Business Impact Analysis (BIA), create detailed network illustrations, update accurate hardware and software inventories, document installation and configuration procedures for all critical production hardware and software, develop a Failure Mode & Effects Analysis (FMEA) report, document failure flow charts, and hardware and software failure procedures.
Scope of Services:
- Interview Management to explore and understand key business and technology requirements for Disaster Recovery.
- Create an inventory of all production hardware and software pertinent to day to day business including the age of each device and if it has a current maintenance agreement.
- Create a Project Plan that lays out the entire process of creating the Disaster Recovery Strategy.
- Define the participants for all meetings and generate a meeting schedule to minimize meeting conflicts.
- Generate a comprehensive Disaster Recovery Strategy manual that defines what to do in case of a disaster. This will include:
- Current Network Diagram
- Complete list of Hardware and Software pertinent to day to day business needs.
- FMEA that defines all potential failure modes, their associated RPN (Risk Priority Number), and the Action Plan to reduce the significant offenders.
- Failure flow charts to define the actions needed to find failed devices or software.
- Failure Procedures to define the steps to resolve the failed devices or software.
- Disaster Recovery Strategy Manual (A playbook for recovering form system failures)
- Action Plan which breaks down cost and time to complete.
Click here to request a consultation with one of our Senior Consultants or call us at 773-205-0200.