Business continuity and disaster recovery (BCDR) are more successful with top-down enforcement, starting with the board and the executive team, as well as key stakeholders held accountable for their respective pieces of the program. This piece details BCDR best practices when migrating to the cloud to ensure recoverability and continuity.
Planning for Cloud Platform Failure
Cloud platforms have tremendous resiliency, but nothing is completely fault tolerant. When migrating to cloud services, BCDR plans should consider:
- Architectural resilience: While assessing where failures can occur, look at historical performance and evaluate track records. Also, consider using multiple cloud service provider (CSP) regions to help protect your workloads when one region within a provider is unavailable.
- Standardization: To ensure the plan is manageable and affordable, avoid customization and work to stay within the cloud platform’s abilities to meet SLAs. This means setting expectations with business units that may expect little or no loss of service. When business units request exceptions, this often requires customization, which can quickly get expensive and complex. In addition, it’s important to ensure technical teams have the training to manage the environment and the program.
- Criticality: If everything is critical, nothing is critical. Perform a risk assessment to determine the likelihood of failure and use that to determine the level of service required for all applications and systems. However, it’s important to realize determining which applications and systems take precedence is ultimately a business decision.
- Shared responsibility: CSPs should clearly outline their responsibilities. However, companies using one or more CSP are ultimately responsible for their data and how it is operationalized within CSPs.
Download: Create Incident Response Metrics Worth Reporting
How to Prepare for Cloud Migration
On a regular basis, stakeholders should come to an agreement on what is important to the business. Get these questions answered early so proper controls can be put in place. The following may already be included in your current BCDR plan because many of the same principles apply in the cloud, but organizations migrating to the cloud should prepare by:
- Identifying critical services and assets: Typically, a business impact assessment (BIA) is available to illustrate what’s most important in the event of an incident If there is a good plan in place, much of the work should already be done. However, this is also a good time to reassess priorities with technology and business stakeholders to ensure alignment.
- Determining recovery time objectives (RTOs) and recovery point objectives (RPOs): Based on the BIA, consider the maximum amount of data loss the business is willing to accept and use that to build your RPO. For critical applications, this will be one of the more important decisions made. Similarly, the maximum downtime you can afford is your RTO.
- Taking your cloud service model into account: SaaS offers a complete solution but is heavily reliant on the provider. With IaaS, the business has the most control and responsibility but will need to manage its own backup and recovery. PaaS is set up and readily available but offers less customization and flexibility.
- Designing an architecture and migration strategy: Consider using different regions (generally in the same country) to help protect against a widespread outage. Critical data, applications and regulatory requirements should all be considered to ensure fault-tolerant designs.
BCDR Pitfalls to Avoid
When planning BCDR in the cloud, some pitfalls to be aware of include:
- Lack of executive support: Your BCDR program strategy should have top-down support and governance. This is important to ensure you have the buy-in and budget needed, particularly to reduce financial impact if and when an incident does occur. Depending on the industry, the ability to continue business and recover may be a matter of life and death. BCDR planning varies depending on the complexity and criticality of applications and tiers. Because this is a business issue, it should not fall entirely on the agenda of the CISO. While the CISO and team are critical for success, so, too, are the CEO, CIO, chief operating officer, chief financial officer, etc.
- Failure to test and validate: It sounds simple, but too often companies go through all the work of planning but then fail to test those plans often enough. It’s especially important to continually check your application RTOs and RPOs to ensure they are acceptable. The more complex the infrastructure and recovery needs, the more this pitfall rises to the top of the list.
- Issues with data backup and replication: Some organizations fail to determine exactly which data must be replicated and miss potential points of failure with their design. Some applications require multiple clouds, and many times, different providers or different regions within providers must be made available if there is a point of failure for critical applications. To meet RPOs, replication must be timely and avoid lag.
- Cost: Data transfer and storage costs can get expensive, but limiting the DR budget may mean RTOs and RPOs become unacceptable. Proper budget must be calculated and allocated from the beginning to ensure the amount invested in BCDR is enough to support critical RTOs and RPOs.
- Remote connectivity: Engineers, BCDR personnel and other critical employees for top-tier applications should be able to access data and applications from anywhere, even during an incident.
- Failure to assess cloud provider controls: It’s essential to assess your cloud provider’s security controls, both initially and on an ongoing basis. However, ongoing assessments demand an experienced and dedicated team for third-party risk management.
- Personnel and communication issues: Having a list of go-to personnel, including names and contact information, is essential. Ensure you know who is qualified to declare a disaster and who their backup is if they are unavailable. Also, know who is in the best position to communicate with internal stakeholders and technology vendors providing services. Be sure to include technical teams and business unit liaisons who know the ins and outs of recovery.
- Compliance and legal issues: You may need to keep data in a specific CSP region due to regulatory or privacy requirements. Using a backup region within the same country can alleviate some regulatory constraints.
- Business misalignment: Which applications and data must be available to serve the most critical services for the company? Where does the company make money and where is the greatest liability if applications and services are not available? A good BIA helps determine the applications and their dependencies that need to be available to meet business and recovery objectives. Strong asset management is a necessary component of a successful BIA process.
- Failure to consider nontechnical requirements: Successful BCDR planning and strategy involves both technology and physical requirements, and good people are key. Be sure to have cross-training, well-documented plans and as much personnel depth as possible (and reasonable) to insulate against the departure of key employees and general turnover.
- Third-party dependencies: Which parts of your supply chain impact the business and interfere with continuity?
- Failure to regularly review the plan: Some organizations build reviews of the BCDR plan into their change management process. Whenever a new application or service is implemented, it must go through BCDR requirements. Likewise, applications and services that are decommissioned should be removed from the plan.
Read: Key Takeaways from the Snowflake Security Incidents
Cloud Testing KPIs and Metrics
Companies should decide which metrics are important to their business and each business is different. Companies like to compare themselves to others, which is understandable. However, each business should decide what it cares about most and then validate the results from testing. Some KPIs that can be used in testing exercises include:
- RTO: This is the time it takes to recover critical systems and applications from the point of failure. How long can an application can be unavailable without causing significant business damage?
- RPO: This is the maximum acceptable data loss within a period of time. How much data can be lost during an incident (if any)?
- Mean time to recovery: What is the average amount of time to recover from a failure? This metric can help determine the efficiency of a recovery process.
- Data integrity validation: This measures whether data is recovered, accurate and ready for use.
- Business acceptance: It’s important to score each business unit’s acceptance of recovery and continuity. Does the process satisfy stakeholders? If not, repeat the BIA and determine the best recovery process, cost involved and potential impact to other systems and applications if changes are made.
- Testing success rate: This measures the rate of success when testing applications and systems.
- Employee training results: This scores the level of BCDR knowledge for technical and non-technical employees. They should know what to do and how to perform their job during recovery.
- Frequency of incidents: Downtime can point to a design problem that must be addressed. While the application may become available, the frequency of the event should trigger a review of infrastructure and configuration.
- Cost and efficiency scoring: What is the cost to operate a successful BCDR program, and are costs increasing or decreasing? What is the cost and financial impact to the business during recovery? How efficient is the recovery process? How many employees are involved?
Read: Why Building a Cloud Specific Playbook is Critical
BCDR Is a Program, not a Project
When moving to the cloud, it’s important to make BCDR planning and testing a priority with business units so they know what the critical data is, where it is located and what its specific RTOs/RPOs are. The more critical data and systems there are, and the more frequent the testing, the more time IT and cybersecurity will need to plan. To ensure your BCDR plan for the cloud is successful:
- Align business units, GRC and technical teams: Don’t let this all fall on the agenda of the CISO. Communication, roles and responsibilities and business-level support are all critical to success.
- Review the current BIA: Your transition to the cloud is a good time to make improvements. It may also remind the team about historical decisions so they can revisit their relevance.
- Communicate and set clear expectations: No one likes surprises. Because there isn’t an endless budget, work within the financial constraints and ensure stakeholders know their critical RTOs and RPOs. If they are unacceptable, work with leadership to secure budget or deal with any trade-offs.
- Focus on collaboration: It is unrealistic to account for all the potential pitfalls, but technical, GRC and business stakeholders can collectively make educated decisions for the business. Focus on the fundamentals around communication, ownership, budget and treating BCDR as a process and not just a point-in-time project.
Although reasonable efforts will be made to ensure the completeness and accuracy of the information contained in our blog posts, no liability can be accepted by IANS or our Faculty members for the results of any actions taken by individuals or firms in connection with such information, opinions, or advice.