The blog ofJonathan Pepin

Building Stable Systems

2016-11-12

It is impossible to build a system that is 100% stable, available and bug free.
Humans will make mistakes. Machines will break. The internet will go offline.
Sometimes you will break things that are under your control, and some other times, things outside of your control will break.

The point is, things will break, you can't avoid it, so instead of denying it you should build your systems taking this into account, and make sure that you buid stability.

When getting a failure, some components of the system will start failing before everything else does. The original trigger and its spread to the rest of the system is called a "failure mode".
You can build crack-stoppers to protect your system and avoid propagation of those failures.

You need to decouple as much as possible the different parts of the system to reduce dependencies and triggers to spread across the systems.

So the first question is, what can happen anyway? A lot, but here is an overview of the main things that you should keep in mind when designing and building software, from MVP to large scale systems.

Definitions

Impulses A rapid shock to the system

Stress A long strain on the system

Stability Being able to keep processing requests despite impulses and stress

Failure Mode The way (or 'mode') in which something might fail

Stability anti-patterns, or WTF can happen?

Today, systems are bigger than what they used to be, which means new and bigger challenges. Tightly coupled systems are the rule.

Their size and complexity push us toward the "technology frontier" where moving cracks rapidly turn into full-blown failures.

Integration points

Softwares are a set of systems integrated together. From the web frontend, to the web server and the CMS, connecting to the database, etc.

Integration points are the number one killer of systems. Every socket, process pipe or remote procedure call will refuse connections, hang, disconnect, etc.

This is especiall true in services oriented architecture when you have exponentially more services talking to each others.

Chain reactions

As you scale horizontally, you end up with multiple servers, doing the same thing and sharing the load, behind a load balancer.

If the connection between the load balancer and a server breaks, or if a server fails because of some load related failure, the remaining servers need to handle the additional load.

With each server that breaks, the remaining servers are more likely to also go down.

Cascading failures

Service oriented architectures comprise of a collection of services that are interconnected to each others and form layers. Or nodes in a directed graph.

Failures start with a crack. A crack comes from a fundamental problem in one of the layers. A cascading failure happens when a crack in one layer propagates to another layer, and eventually bring the whole system down.

Just as Integration Points are the number one source of cracks, cascading failures are the number one crack accelerator.

Blocked Threads

The majority of system failurs do not involve outright crashes. Those are pretty easy to debug and fix.

Usually you see a process run and do nothing, because every thread is blocked waiting on some process that never ends or response that never comes.

Blocked threads can happen anytime you check resources out of a connection pool, deal with cache or make calls to external systems.

Attack of self denial

Those happen when the system self-conspire against itself.

For example, the stress happening during a big marketing campaign bringing a lot of traffic to an e-commerce website.

Unbounded result sets

Those happen when you play with sets of data that are bigger than you expect. Querying all rows from a database could eventually return an infinite amount of items that would slow your processing quite a bit. Sometimes, the amount of data can become so big that it won't even hold in memory, and break your system. Always keep that in mind and make sure that you set limits when querying data.

Slow responses

Slow responses generate cascading failures because each process left handing is blocking a thread. For client-facing assets, such as a website, it causes a surge in traffic because visitors will likely spam the refresh button if the request is too slow.

SLA Inversion

Every single dependency of your system is an integration point that can break. This means that your SLA can only be as good as the total SLA of your dependencies. If you have 2 dependencies that have each a SLA 99.99% of availability, that means that you can't offer more than 99.98% of availability.

Stability Patterns, or How to ensure resiliency?

Timeouts

It is essential to have a timeout on any resource-blocking thread. TCP, Database connections, etc.

Timeouts can often be coupled with retries, but it's not always a good decision. Make sure to add retries only when it makes sense. Too many retries will also make threads hang longer and clients wait more.

Circuit breaker

A circuit breaker is a wrapper that circumvents calls when a system is not healthy. It's the opposite of a retry since it prevents additional calls rather than execute them.

Once the number (or frequency) of failures reaches a threshold, the circuit breaker "opens" and fail all consequent calls.

It's a very efficient way to automatically degrade functionality when a system is under stress.

Bulkheads

A bulkhead is an upright wall within the hull of a ship that partitions it into water resistant compartment preventing the ship to sink or be waterboarded in case part of the hull is broken open.

The same technique can be employed with your software, so when part of it is under stress and breaks, the rest of your systems continue to function.

Steady state

Every time a human touches a server is an opportunity for unforced errors.

Keep people off the production servers as much as possible by automating regular maintenance tasks.

Any mechanism that accumulates resources must be drained at some point, and at a faster pace than it accumulates those resources, or it will eventually overflow.

Steady State patterns say that for every mechanism that accumulates resources, some other mechanism must recycle those resources.

Data purging

Sounds like a high-class problem, but at some point your database will start having issues, such as increased I/O rates, latencies, etc.

Being able to purge data from it while keeping your systems running is hard and you need to be prepared for it.

Log Files

Logs accumulate very quickly and take up disk space. Last week's log files are already not very interesting, so anything older than this is pure garbage.

At some point they'll fill up their containing file system and jeopardize the whole app with ENOSPC errors.

If you need to save logs (to stay compliant with financial information for example), then back up your logs in a separate machine meant for this.

In-memory caching

Same as logs and databases, caching takes up valuable memory from the server. Make sure to set up correct TTL on your cache so it gets regularly purged and useless cache is not blocking any crucial memory for anything else.

Fail fast

Just like slow responses, slow failures are very bad, because you end up using resources for nothing.

Handshaking

Handshaking is all about letting two devices that communicate know about each other's state and readiness. It create cooperative demande control.

It may add an additional call to check a dependency's health, which is additional time and resources needed, but is usually less costly than a failing call.

Integration testing

Unit tests are good and you should strive for 100% coverage, but even then, you won't be testing everything, because unit tests are meant to test what is expected from your services. They do "in spec" testing.

An integration testing environment is meant to replicate the production environment, and break it in unexpected ways that unit testing are not covering. It helps testing failures such as network transport and protocol, application protocol and application logic.

Middleware

Middleware occupies the essential interstices between systems that were not meant to work together.

Done well, it simultaneously integrates and decouples systems.

There are synchronous middlewares that force the systems to hang and wait. They can amplify shocks to the systems, but sometimes are necessary (such as authorizing a credit card during a transaction).

Some, less tightly coupled middlewares allow calling/receiving to happen at a different time and place, such as a pub/sub messaging system.

But those are sometimes less useful or harder to deal with.

Summary

No matter what you do, shit will happen in the most unexpected ways.

Avoiding stability anti-patterns will help minimize bad things happening, but never fully prevent them.

Apply stability patterns as you need to protect your systems from going completely down when bad things happen.

Be cynical, be paranoid; In software development, this is a good thing.

Those were my notes for Part I: Stability, of "Release It!".