Data is the new oil. We rely on it not only to make decisions but to operate as a business in general. In this article, you can find ten actionable methods to protect your most valuable resources. Data loss can lead to significant financial consequences and a damaged reputation. As Werner Vogels says: “Everything fails all the time.” 1. Backup, backup, backup This goes without saying, and we all know it. We need to have a backup strategy and an automated way of regularly taking periodic snapshots of our databases. However, , implementing a reliable backup plan that can becomes challenging. Therefore, it is crucial to develop a strategy of and implement a solution that can satisfy your Business Continuity plan. with today’s large amounts of data quickly recover your databases Recovery Time Objective and Recovery Point Objective RPO vs. RTO describes how many hours of downtime we can tolerate. would entail that your business can afford according to your Business Continuity Plan. You could think of RPO in terms of the “staleness” of your backup, plus the recovery time. With RPO=10, we allow our data to be 10 hours stale after restoration, i.e., not containing changes made within the last 10 hours. Recovery Point Objective (RPO) RPO of 10 no more than 10 hours of data loss In contrast, describes within which time the database must be up again. would mean that regardless of the backup freshness, the . Recovery Time Objective (RTO) RTO of 3 database must be up and running within 3 hours after the downtime occurred 2. Test your recovery scenario Probably the worst-case scenario is that you developed a backup strategy, and you are regularly taking snapshots, but when the failure happens you notice that those as intended or that you can’t find them. It’s critical to . backups aren’t working test the recovery scenario pioneered “Chaos Engineering” — a discipline of testing failure scenarios on production systems to be sure that your infrastructure is truly resilient. Netflix Otherwise, you risk ending up in the “cross your fingers and hope for the best” strategy. Don’t count on backups and recovery plans that have never been tested. Note that if you rely on backups taken by some fully managed service where you don’t have actual access to the snapshot, you risk that restoring your database may take longer than your RTO and RPO strategy allows. It’s possible that due to time-zone differences and a large volume of data that may need to be transferred over a long distance, the . Therefore, it might help to take regular snapshots yourself rather than solely relying on backups from a specific provider. recovery may take longer than you expect 3. Document processes that rely on that data(base) If your database goes down, which processes are affected? It’s valuable to have this information documented somewhere to mitigate the impact of a failure and being able to recover quickly by restarting corresponding processes and mitigating the impact of downtime. 4. Apply the least-privilege security principle We all want to trust people, but allowing without educating them on how to use those production resources may backfire. Only a few trusted people ( ) should have . When building any IT solutions, it’s best to work on a development database and have read-only permissions to production resources. too much access to developers likely DevOps or senior engineers direct access to modify or terminate production resources On top of that, it’s advisable to . If you haven’t done so in a while, take this as a sign. Perhaps somebody who has left a company still has access to production resources? check those permissions regularly 5. Name your production database as such What if your production database is not named as a “ ” resource and somebody confuses it for something else? It’s best practice to ensure that production resources are named properly so that already by looking at it people know that this is a resource that must be treated with great care. prod It may seem obvious to you but without proper communication and educating users, somebody could confuse a poorly named production database for some temporary resource ( ) that can be shut down. for instance, a playground cluster 6. Don’t trust any manually configured resources If your resources are configured manually, it becomes . Modern DevOps and GitOps culture introduced a highly useful paradigm of Infrastructure as Code, which can significantly help to build an exact copy of a specific resource for development or recovery scenarios. more difficult to reproduce the configuration in a failure scenario 7. Don’t allow a single person to manage the entire infrastructure It can be if the only person who knows how to configure and use it is not available when the failure happens. Knowledge silos are particularly dangerous in such use cases. It’s beneficial to have that can take over this responsibility. Often even a timezone difference between employees can significantly contribute to fixing any production downtime faster, and therefore, to meet your RTO. challenging to recover any specific system at least one additional person 8. Educate your employees about any resource before giving them access to it This point is related to but more directed towards educating developers. Anytime we give somebody more than just read-only access to production resources, we should and what impact a potential downtime of a single table may have. As always, effective communication is our best friend. preventing knowledge silos educate them on using this resource properly 9. Use serverless & monitor your resources Using data stores such as AWS RDS is great, but it has a downside that, in the end, we are still responsible for . When using serverless data stores such as DynamoDB, we can rely on AWS DevOps experts to monitor and keep the underlying servers healthy. ensuring that our database remains healthy If you leverage an observability platform, you can within your serverless infrastructure. Some platforms continuously scan your resources for anomalies. For instance, Dashbird about any DynamoDB table that doesn’t have a continuous backup and Point-In-Time-Recovery enabled. quickly identify misconfigured resources or failures will alert you This is one of the easiest ways of ensuring that your data store remains healthy and resilient because AWS takes care of serverless compute and storage behind the service, ensuring High Availability and Fault Tolerance. will alert you if your architecture deviates from standards defined within the Well-Architected Framework, such as when your resources are not properly configured or lack backup. Dashbird In the image below, you can see that Dashbird automatically detected that backup is not enabled: In addition to recovery information, you can , as demonstrated in the image below. For instance, you will be informed any time your have . In the end, you are presented with a score of how well your architecture adheres to the Well-Architected Framework. discover many more insights about your serverless resources real-time data streams write-throttles Well architected lens — Image: courtesy of Dashbird And if the only reason that holds you back from using DynamoDB is that you still want to use SQL, you may have a look at . This query language, developed by AWS, allows you to query your DynamoDB tables ( ) directly from the AWS management console, as demonstrated in the image below. PartiQL and many other data stores 10. Separate your storage from compute if possible This point is related to analytical databases. if your compute and storage are independent of each other. Imagine that your data is durably stored in object storage such as S3, and you can query it with a serverless engine such as AWS Athena or Presto. The separation of how your data is stored and how it’s queried makes it easier to ensure the resilience of your analytical infrastructure. It’s a good practice in analytical data stores You can establish automatic replication between S3 buckets, enable versioning ( ), or even prevent anyone from overwriting or deleting anything from S3 by leveraging . Then, even if your Athena table definition is deleted, your data persists and can easily be queried upon a definition of schema in AWS Glue. allowing to restore deleted resources object locks I’m a big fan of storing raw extracted data for ETL purposes into object storage before loading it to any database. This allows using it as a staging area or data lake and allows . Relational database connections are fragile. more resiliency in analytical pipelines Imagine that you are loading large amounts of data from some source system directly into a data warehouse. Then, shortly before the ETL job would be finished, it fails because the connection was forcibly closed by a remote host due to some network issues. Having to redo the extraction step can introduce an additional burden on the source system or may even be impossible due to API request limits. Conclusion In this article, we examined . These days, data is such a that downtime can cause significant financial and reputation losses. Make sure to approach it and . ten ways to protect your mission-critical data store critical resource strategically test your recovery scenario Further reading: DynamoDB vs Mongo Atlas DynamoDB continuous backupos and Point-In-Time-Recovery disable d Why Serverless apps fail and how to design resilient architectures Previously published at https://dashbird.io/blog/10-ways-protect-mission-critical-database/