By Hank Cranmore
I get this question all the time from other consultants, “Whats the Best Backup out there?”. Its a very important question. Unfortunatly, the answer is often found by many in haste with a prepacked solution that is not a complete strategy.
This summary is the result of many hours of research to identify a best practice for Data Backup. Many folks think certain solutions like RAID are a real backup option but they are not. And now, with the cloud promising offsite data storage, there is even more room for misunderstanding. You either backup your data or risk losing it.
You and you alone are responsible for your own data and ensuring it is backed up and ready to recover as fast as you need it. Despite the promises of cloud and virtual computing technology, you will still need both a local copy of your data as well as even the original documents in a worse case scenario. How you manage and tolorate risk will reflect in your backup strategy as a balanced plan or lack of one.
Do you want to rely upon a custom server that uses motherboard based raid? If so, then chances are that a replacement board will not be around when you need it. Perhaps a RAID card and a drive cage system is a better idea. At least in worst case scenario, you get an identical chipset board and pull the raid system from the dead server and drop in into the new system. Better yet, get a DELL or HP server for three to five years on 4 hour or next business day parts replacement plan and your hardware replacement concerns are now worry free. You just will not be able to keep the server for 10 years like some do with custom builts. You will need to plan regular hardware refreshes. Again, its all about your budget and risk tolorance.
I have this grouped in stages from planning to the ultimate and worst data retrieval option which is manual recreation due to total failure of all redundancy and backup methods.
SOLUTION Level/Type 0 – Plan, Design and Document SOP
You have to plan and document your Back Up and Disaster Recovery program based on your business needs, speed of recovery requirements, risk tolerance and budget.
SOLUTION Level/Type 1 – Build on Fault Tolerance – “The cost of failure”
Fault Tolerance is the solid foundation that you want to build your network on. It protects and prevents common problems from causing uncommon disaster.
POWER QUALITY issues can be a major source of problems if it is not conditioned and able to be supplied properly and continuously. The building needs proper wiring and available amperage for actual needs to prevent breakers from shutting down. Surges, Sags/Brownouts, and spikes can damage hardware and cause problems. The higher the wattage on a power supply the better the server or desktop can handle a surge or sag. A quick and high voltage spike can cause immediate or accumulative hardware damage over time. Cheap everyday surge protectors use Metal-Oxide Varistors and only protect a few times before becoming just a power strip. High quality conditioners contain other technologies as well as heavy iron transformers. Use quality Power Conditioners and surge protectors that alert you when its damaged, ($50 to $1000) for printers and UPS with quality inverters for PC’s.
AIR CONDITIONING keeps your servers an hardware running at proper temperatures despite the heat that is generated and can build up by many systems running in an enclosed room. If an AC unit goes down in the server room, the network is down until it is fixed unless there is a backup. Heat can build up slowly and if allowed to spike regularly can accelerate equipment failure. Dirty filters can reduce efficiency and allow heat to build up.
HUMIDITY control protects against mildew, condensation, ESD electrostatic discharges and dry rot. Consider installing humidity monitoring and control devices.
WATER LEAKS require planning to move servers away from sources of potential water damage such as overhead pipes, water heaters, and condensation buildup. Consider installing leak and moister detectors and even automatic water pumps.
VIBRATIONS and jarring shocks can cause downtime. Hard drive power connectors may vibrate loose. Slamming doors and outside vehicle activity can shake entire walls and any equipment near them. Consider installing shock sensors.
SOLUTION Level/Type 2 – Real Time/Hardware Redundancy RAID and MIRROR
RAID Uses redundancy to store data across a mirror or array of multiple hard disk drives in real time. Protects against the loss of at least a single drive and is onsite only. Fire or loss of multiple drives renders the data irretrievable. Can overwrite good data with bad data.
MIRROR Desktops running Windows XP can “mirror” files to a Central Server
Please note that RAID or Mirror is not a backup solution.
SOLUTION Level/Type 3 - Point in Time Backups
From my work with database programming I bring the concept of “Point In Time” to the data backup strategy. People who change tapes daily and rotate across a schedule of 7 or more tapes are already doing this manually. This concept occurs before the data is stored to the removable backup device, allowing the fastest possible data restoration and when used will be the primary backup you restore data from.
SNAPSHOTS. You can take a “picture in time” of your data at predefined times and preserve it for a specified time. Good for when RAID and MIRRORING results in good data overwritten by bad data. Best used to retrieve individual files. Multiple “snapshots” can exist for the same data location. Three snapshots are common, one every few hours, one for each Friday evening, one for each past month. You generally retrieve and or restore from these snapshots. This should be one of the first locations you look to restore data. You still cannot take offsite. They can fail themselves, and can take up a large amount of space and require 3 or more times the space of the actual existing data.
SOLUTION Level/Type 4 - Desktop and Server Image profiles
15 minutes or less to restore an OS drive verses 8+ hours. Include with data to be backed up or treat as a separate backup process but subject to a same scope of solutions. Can be done across network from an onsite image server. Can also be used for Point and Time snapshots of machines.
SOLUTION Level/Type 5 - Onsite Restore Backup
Backups stored onsite allow for easier and faster local recovery and thus are critical to include in your solution yet is rendered useless in a fire, theft or other physical disaster. Many onsite backup solutions cannot be taken offsite. Fire safes can be used but it is still onsite backup storage and other disasters can still render the backups useless or unobtainable.
SOLUTION Level/Type 6 - Offsite Archive and Restore Backup Storage
Offsite backup is more demanding to maintain and archive. It may be slower to obtain and recover data this way. However, data is offsite and not subject to fire or other disaster to the workplace. You should always include an offsite backup in any solution as it is generally the last option before costly recovery or total data loss.
SOLUTION Level/Type 7 – Data Recovery
Used when Backups have failed or were never done. Can be attempted on any media. Not guaranteed to work. Data retrieved may be unusable. Risky, do not depend on it. Yet you should have a contingency plan in place for this option. Accidents and bad luck happen.
SOLUTION Level/Type 8 – Hard Copy
All planning and procedures to protect, retrieve, recover and restore data have failed. You must now go back to the original paperwork and re-enter all data manually. Same thing as not having a backup. This is the most costly of all possibilities in time, money and duplicated productivity.