High availability is premature optimization
Conventional wisdom these days, especially for startups, is to design your software architecture to be horizontally scalable and highly available. Best practices involve Kubernetes, microservices, multi-availability zones at a hyperscaler, zero-downtime deployments, database sharding and replication, with possibly a bit of serverless thrown in for good measure.
While scalability and high availability are certainly required for some, the solutions are often being recommended as a panacea. Don't you want to be ready for that hockey-stick growth? Do you want to lose millions in sales when your site is down for a few hours? Nobody got fired for choosing Kubernetes¹, so why take the chance?
In ensuing flame wars on Hacker News and elsewhere, this camp and the opposing You Ain't Gonna Need It (YAGNI) folks talk past each other, exchange anecdata and blame resume-padding architecture astronauts for building it wrong.
A hidden assumption in many of these discussions is that higher availability is always better. Is it, though?
Everything else being equal, having a highly-available system is better than having one that has poor reliability, in the same way as having fewer bugs in your code is better than having more bugs in your code. Everything else is not the same, though.
Approaches to increase code quality like PR reviews (maybe from multiple people), high test coverage (maybe with a combination of statement-level and branch-level unit tests, integration tests, functional tests and manual testing) increase cost proportionally to the effort put in. A balance is achieved when the expected cost of a bug (in terms of money, stress, damage, etc.) is equal to the additional cost incurred to try and avoid the bugs.
You can't add a bit of Kubernetes, though. The decisions about horizontal scalability and high availability influence the entire application architecture (whichever way you choose) and are hard to change later. The additional architecture and ops complexity, as well as the additional platform price to support it, goes up much easier than down.
Faced with this dilemma, it pays to first understand how much availability we really need, and how quickly we will need to scale up if needed. This is going to be specific to each project, so let me share two examples from my personal experience:
At my previous startup, AWW, our product was a shared online whiteboard - share a link and draw together. It was used by millions of people worldwide, across all timezones, and the real-time nature meant it had to be up pretty much all the time to be usable. If you have a meeting or a tutoring session at 2PM and are using AWW, it better be working at that time! One stressful episode involved scheduled downtime on an early Sunday morning European time and getting angry emails from paying customers in India who couldn't tutor their students on the Saturday evening.
Clearly, for AWW the higher the availability, the better. During COVID, we also experienced the proverbial hockey stick growth, servers were constantly "on fire" and most of the tech work included keeping up with the demand. A lot of complexity was introduced, a lot of time and money was spent on having a system that's as reliable as it can be, and that we can scale.
On the other hand, at API Bakery, the product is a software scaffolding tool - describe your project, click a few buttons to configure it, get the source code and you're off to the races. It's a low-engagement product with very flexible time limits. If it's down, no biggie, you can always retry a bit later. It's also not such a volume product that we'd lose a bunch of sales if it's down for a few hours. Finally, it's not likely to start growing so fast that it couldn't be scaled up the traditional way (buy a few bigger servers) in a reasonable time frame (days). It would be foolish to spend nearly as much effort or money on making it scale.
When thinking about high availability and scalability needs of a system, I look at three questions (written here with example answers to better illustrate the thought process):
-
How much trouble would you be in if something bad happened:
- low - nobody would notice
- minor - mild annoyance to someone, they'd have to retry later; small revenue loss
- major - pretty annoying to a lot of your users, they're likely to complain or ask for a refund; significant revenue loss
- critical - everything's on fire, you can't even deal with the torrent of questions or complaints, incurring significant revenue and reputation loss
- catastrophic - you're fired, your company goes under, or both
-
How often are you prepared to experience these events:
- low - daily or weekly
- minor - once per month
- major - once or twice per year at most
- critical - hopefully never?
- catastrophic - definitely never!
-
What's the downtime for each severity:
- low - 30s/day (AWW - we had auto-recovery so this was mostly invisible to users), 5min/day (API Bakery)
- minor - 5min/day (AWW), 1h/day (API Bakery)
- major - 1h/day (AWW), several hours outage (API Bakery)
- critical - 4h+/day (AWW), several days outage (API Bakery)
- catastrophic - 2+ days (AWW), a few weeks outage (API Bakery)
These are example answers to give you intuition about thinking in terms of expected cost. In this case, it's obvious that the availability and scalability needs of AWW and API Bakery are wildly different.
Quantifying the costs of (not) implementing some architecture or infrastructure decision is harder, and also depends on the experience and skill set of people involved. Personally, for me it's much easier to whip up a VPS with a Django app, PostgreSQL database server, Caddy web server, with auto-backups, than it is to muck around with Helmfiles, configuring K8s ingress and getting the autoscaling to work, but I know there are people who feel exactly the opposite.
When quantifying the cost, I think about:
- is this something we already know how to do?
- if not, does it make sense to try and learn about it (appreciating the fact that we will certainly do a substandard job while we're learning)
- can we engage outside experts, and will we be dependent on them if we do?
- what are the infrastructure costs, and how easy is it to scale them up or down?
- how will the added complexity impact the ongoing development, growth and maintenance/ops of the system?
- how far can we push current/planned architecture and what would changing the approach entail?
We might not be able to get perfect answers to all these questions, but we will be better informed and base the decision on our specific situation, not cargo-culting "best practices" invented or promoted by organizations in a wildly different position.
¹ I'm not hating on Kubernetes or containers in general here. Those are just currently the most common solutions people recommend for failover and scaling.