A newsletter about complex systems and how they can fail.
Every human being on earth depends on complex systems for survival. From the basics like food and fresh water to luxuries like Wi-Fi and air travel, we couldn't live without complex systems. This is true for island nations that have no contact with modern humanity as well as rich industrialized nations.
When these systems break down or misbehave even for short periods of time it can cause more than just an inconvenience — it can cost human lives. Power outages. Slow Wi-Fi. Pandemics. Cold take out. Famines. Clogged toilets. Wars.
The good thing is, the more complex you make a system, the less likely it is to fail, right? Right? RIGHT??
Well, not always. The more complexity you add to a system, the more failure modes it will accrue. Each function you add is another thing that can fail, and more often than not a single feature can fail in a number of fun and interesting ways.
Engineering these complex systems usually involves the tedious exercise of listing out all the fun and interesting ways your system can fail along with an analysis of what would happen if/when it does fail. Thats what an FMEA (Failure Mode and Effects Analysis) is — a list of all the ways your system can fail, and some analysis of what the consequences of that failure might be.
As part of completing an FMEA, you'll typically rank each failure mode on 3 criteria:
- Severity — if it happens, how bad is it?
- Occurrence — how often does it happen?
- Detection — if it happens, how easily can we detect it?
Each criteria is given a ranking from 1-9, and when you multiply the individual rankings together you get what's called an RPN (Risk Priority Number). Often there are so many failure modes that it's not likely you'll be able to prevent every single one — the RPN score provides a way to prioritize the failure modes that are the most consequential.
Each edition of this newsletter will pick a complex system to do a mini FMEA on. These documents (usually taking the form of a spreadsheet) are normally quite large, with potentially thousands of failure modes. For the sake of the newsletter, I'll likely abbreviate this format a bit by focusing on a single failure mode.