In Part 1 we discussed the need for a discrete location and the networking aspects of DR. Now we will dive into the Replication, Automation and Management components.
3) Replication software
This is the most visible of the pieces typically but still doesn’t mean that it is being done adequately. The decision of what replication software is not going to be the same for everyone. It’s highly dependent on how the customers business operates and what type of applications they are wanting to protect.
First and foremost is the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). This needs to be driven by the business itself. Too often I see the RPO and RTO being simply what the product can do rather than what the business needs.
What I want to see happening is for IT to start talking to the business stakeholders. Determining based on what type of application it is what RPO and RTO are required. Just because the application is Tier 1 it doesn’t automatically mean that it needs the lowest RPO.
Most important VM =/= Lowest RPO
When determining the RPO for each application it should be driven firstly by the stakeholders. What they say is an acceptable loss of data in the event of a disaster. Then by considering how chatty or static the VM is we can start to determine its ranking compared to other applications being protected. Then, in the event of link congestion the most critical changes are replicated over to the DR site. This ensures that the applications that NEED the lowest RPO receive that sacred bandwidth.
There is no silver bullet! The best solution for one company may be different to that of another because each business has different requirements and drivers. With that in mind it’s little surprise then to find out that Zettagrid have more than one DR solution on offer. I strongly urge you to go through the pros and cons of them to determine what’s important to you. Then go with the product that best meets your requirements.
4) Automation/middleware to stitch DR all together
Sure, you might have a big red button to fail over or to test in your current solution. Typically that’s for the VMs and local networks and to be honest most solutions can accomplish that to some degree.
How was the setup of it though? Did it involve sending admins on courses learning about the software before it could be installed? Often the installation and setup end up being incredibly complex and convoluted. I’ve lost count of the times I’ve heard that DR solution X got a bad rap. Often because the IT admins or consultants just couldn’t get the setup perfect so they dropped it and moved on.
So, in this context when I refer to automation or some sort of middleware I’m not talking about the day to day running of it after the 6-month installation project… I’m talking about automating the implementation of the solution itself. You only have one chance to make a first impression. The same goes for new DR products so you don’t want long lead times muddying the waters.
The installation and setup should be almost 100% automated. Then the product itself has the best chance for buy-in across the business. Therefore a great shot at longevity and being something the business can rely on for many years.
Choose a solution that not only gives you an automated test/live failover and easy protection of VMs but also automates the deployment and configuration of the following;
- Compute and storage resources in the second data centre.
- Firewall device in the second data centre.
- Network connectivity to the production data centre for replication (IPsec-VPN, SSL-VPN or MPLS etc).
- An independent portal in the secondary data centre which to operate and manage the newly failed over applications.
5) Management in DR of people and processes
The final piece of the puzzle is the creation or inclusion of everything into the Business Continuity Plan (BCP) or DR Plan.
Many people I speak with believe that the run-book from their DR product is good enough to be their DR plan and this is simply not true. There is still the need to think about people and processes formally. Then if/when the proverbial hits the fan there is no guess work. You simply follow the steps you’ve written to a safe recovery and return to operation.
I will leave you with a few open-ended questions. These should start some talks in the business or with your customer to ensure that they’re on the right path to success;
Who in the business is authorised to call an outage a DR event?
Where do people work from in DR?
How do people access servers and applications in DR?
Are you going to have terminal servers to connect into, VDI or something separate?
Does everyone required know the location of the BCP or DR Plans?
How often are you going to schedule DR tests to build confidence internally?
Do you have any physical devices like desk phones or printers that you need in DR?
What are you doing for external DNS?
I hope this encourages some talks within IT on making sure that we alter our way of thinking. Really start to see DR as more than just a replication product. Start to think about it holistically and really make sure that it’s not a mere tick box and that the business truly lowers risk.