Engineering

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part V - Disaster Recovery

Posted: August 14, 20247 min read
<- Back to Blog Home

Share

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!Sign up

This article is part of a 6-part series on DigitalOcean Kubernetes best practices.

In Part 4 of the series, we covered “scalability” best practices. We discussed knowing your DigitalOcean account limits, using the horizontal pod autoscaler and the buffer node, optimizing the application start-up time, scaling the DNS, scaling the caching and the database, validating the scalability of your application with network load testing, and adding resilience to Kubernetes API failures.

In Part 5, we focus on disaster recovery. Developers working at small and medium-sized businesses (SMBs) tend to focus on building their applications. It’s easy for them to neglect backing up their cluster and data as they build. SMBs need to be prepared for unforeseen events, as disasters can be especially critical if they run their applications in a single region to optimize their cost, for example, or without any disaster recovery planning in place. We will look at what Disaster Recovery Planning (DRP) involves, explore common challenges, and lastly, provide a comprehensive checklist of best practices to help you prepare for disaster recovery.

Disaster recovery planning

Production-ready cloud environments such as Kubernetes offer self-healing mechanisms for your workloads. Basic primitives such as Replicasets and Deployments, plus a good pod distribution strategy as described in part 4 of this series, can prevent most disruptions for your applications’ end-users. Given enough resource capacity, it is safe to assume that a crashed workload will be restarted and pods on a crashed node will be redistributed to healthy nodes. Those cases are met on a day-to-day basis and are far from being disastrous.

Unfortunately, we cannot solely rely on those Kubernetes mechanisms to save the day. Teams should be able to recover applications or the entire cluster in the event of an unexpected occurrence like a data center outage, hardware failures, downtime due to human error, or even security breaches.

DRP involves planning the recovery of critical infrastructure pieces in the event of a disaster. A well-written and thoroughly tested procedure is the best way to resolve a critical incident with confidence and speed. Here are some of the requirements for a DRP:

1. Understand the backup requirements and implement the recovery strategies

Identify the namespaces/components of your cluster that are critical to business continuity and need to be included in the backup plan. It is important to find a cost-effective way to back up the cluster since it might not be necessary for you to back up all the Kubernetes objects, for example, if there is no persistent volume associated with them. SnapShooter, DigitalOcean’s backup and recovery solution, available on DigitalOcean Marketplace, allows you to back up the data on your cluster per namespace.

Identify the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of the cluster. RPO defines the amount of data loss that your cluster can tolerate while RTO defines how long your cluster can remain unavailable. With SnapShooter, you can configure how often you want to backup your cluster and how long you want to retain your backup data depending on how those objectives are defined.

2. Regularly test and verify backups

Regularly test and verify backups to ensure that they can be successfully restored. Automate this solution, or have a step-by-step document.

3. Document the disaster recovery plan

Document how to back up and restore the cluster with clear instructions that any team member can follow.

Challenges of disaster recovery

Despite the ease of backing up and restoring cluster data with recovery solutions like SnapShooter, the disaster recovery process can remain challenging due to the distributed and dynamic nature of the Kubernetes infrastructure, which consists of multiple layers of services interconnected with each other. Disaster scenarios are often unique, with a differing set of problems and resolutions each time. Developers working at SMBs must mitigate downtime by preparing for common issues and considering the edge cases in advance.

Here are some challenges with disaster recovery:

1. Direct impact on business continuity

Many applications critical to business run on top of the Kubernetes infrastructure and any failures can lead to an immediate disruption in their services. Data breaches or losses as well as extended downtime can lead to customers churning and potentially legal implications.

2. Complexities in recovery due to edge cases

Disaster recovery is sometimes not as simple as just deleting and restarting the application. It is necessary to preserve the load balancer and the associated DNS records. Otherwise, you will lose the external IP and need to recreate the DNS mapping which can take time to synchronize across users and lead to extended downtime. If you are using many certificates for various domains and an issuer like LetsEncrypt, you may encounter certificate rate limit issues while recovering your cluster. If you have secrets generated and stored internally in the cluster, they can be lost along with your cluster during the disaster. They will need to be regenerated manually, and can lead to extended downtime.

3. Difficulty to reproduce, test, and validate the disaster scenarios

The root cause of a disaster may not always be obvious and it is not only difficult to simulate the disaster scenarios but also to automate the disaster recovery process.

Kubernetes Best Practices

This section describes a list of practices that could help your organization prevent or recover more quickly from disasters.

Checklist: Use GitOps and secret manager

Instead of manually applying the manifests, you can store the cluster’s state in a Git repository as a source of truth and reconcile the cluster state from there. In the GitOps model, a GitOps controller runs on the cluster and is responsible for synchronizing the state of the cluster with the specified Git location. In case of a critical failure, the latest manifests found in the trunk are applied, and the cluster is in its pre-failure state.

Please keep in mind that secrets shouldn’t be exposed outside the cluster, particularly in a Git repository. This creates a challenge for the GitOps model where the manifests are kept in the Git repository. A sealed secrets controller to enable encryption of secrets outside the cluster is popular in the GitOps world but this is also vulnerable to disasters.

We recommend using a secret manager to keep secrets out of the clusters. Examples include Vault or 1Password. To access the secrets from those secret managers, you just have to set up an external secrets operator on your cluster. This also helps ensure that the secrets are not lost during a disaster and can be easily fetched once the cluster is restored.

Checklist: Keep state outside of the clusters

When applications using persistent volumes (PV) crash or disk data gets corrupted from hardware failures and needs to be restored, they need the last working copy of the application volume/data. Otherwise, the lost data will be irreversible, and the newly created application will start in an empty state.

There is a backup and restore solution with the Etcd and VolumeSnapshot offered by Kubernetes, but the process can be simplified with a backup and recovery solution like SnapShooter.

By keeping the state outside of the cluster, the Kubernetes resources are not lost along with the disasters. They can be easily restored to the cluster promptly and migrated to another cluster if necessary.

Checklist: Schedule backups

SnapShooter allows you to configure the backup schedule, such as how frequently you want to back up your cluster data. You can also configure the retention policy to tell how long you want to retain the backup data. Scheduled backups help ensure that the backups are always up to date and will recover the latest data.

Testing and validating your backup recovery procedures is important to confirm that the backup data is consistent and reliable. Restore backups in a new cluster and validate that the restored resources, including but not limited to the configmaps, secrets, and PVs, match the backup source.

Checklist: Prefer high availability setups

DigitalOcean Kubernetes provides a high availability (HA) option that increases uptime and provides 99.95% SLA uptime for the control planes. The default control plane runs a single replica of each component and some downtime will occur during unexpected failures as components are restarted. If you enable high availability for a cluster, multiple replicas of each control plane component are created, helping to ensure that a redundant replica is available when a failure occurs.

To enable HA while creating your cluster, select the Add high availability checkbox under the Get extra reliability for critical workloads section. You can also update an existing cluster to enable high availability on the Control Panel or with a doctl command such as doctl kubernetes cluster update example-cluster --auto-upgrade --maintenance-window saturday=02:00.

Checklist: Place guardrails

Consider adding an admission webhook to prevent misconfiguration of Kubernetes resources. Kubernetes allows you to define two types of admission webhooks— the validating admission webhook and the mutating admission webhook. You can refer to the implementation of the admission webhook server validated in a Kubernetes e2e test to write your own webhook server. This webhook basically adds a layer of validation to help protect your cluster from unintended consequences.

You can also consider granting permissions to each user through Role-based Access Controls (RBAC) or manage permissions with service accounts. Both of these techniques involve defining a cluster role with relevant permissions to different resources and binding the user or the service account with a cluster role binding. DigitalOcean is adding support for the modifier and the resource viewer role along with other standard roles in 2024, which you can configure for each team member on the Cloud Control Panel. Instead of giving everyone full access, RBAC will only allow relevant people to modify or delete resources.

Next steps

In the final part of our Kubernetes adoption journey series, we will delve into securing your Kubernetes environment, covering best practices for network policies, access controls, and securing application workloads. Enhancing your infrastructure’s security is the last crucial part of navigating the complexities of Kubernetes. Stay tuned for insights to help empower your Kubernetes journey.

Ready to embark on a transformative journey and get the most from Kubernetes on DigitalOcean? Sign up for DigitalOcean Kubernetes here.

Share

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!Sign up

Related Articles

How startups scale on DigitalOcean Kubernetes: Best Practices Part VI - Security
Engineering

How startups scale on DigitalOcean Kubernetes: Best Practices Part VI - Security

Introducing new GitHub Actions for App Platform
Engineering

Introducing new GitHub Actions for App Platform

How to Migrate Production Code to a Monorepo
Engineering

How to Migrate Production Code to a Monorepo