Security of complex systems

As I write this (September 2014), the Internet is in panic over a catastrophic remote code execution bug in which bash, a commonly-used shell on many of the today’s servers, can be exploited to run arbitrary code.

Let’s backtrack a bit: how is it possible that a bug in command-line shell is exploitable remotely? And why is it a problem if a shell, designed to help its user run arbitrary code, allows the user to run the code? It’s complicated.

Arguably, bash is just a scapegoat. Yes, it does have a real bug that causes environment variables with certain values to be executed automatically, without them being invoked manually[0]. But that seems like a minor issue, considering it doesn’t accept input from anyone else but the local user and the code runs as the local user.

Of course, there’s a catch. Certain network servers store some information from the network (headers from web requests) in an environment variable to pass it on (to the web application). This is also not a bug by itself, though it can be argued it’s not the best possible way to pass this information around.

But sometimes, web applications need to execute other programs. In theory, they should do so directly by forking and executing another programs, but they often use a shortcut and call a standard system function, which calls the application indirectly — via the shell[0]. As an example, that’s how PHP invokes the sendmail program when the developer calls the mail function.

Any one of the above, when taken separately, though not ideal, doesn’t seem like a serious problem. It is the compound effect that’s terrifying:

  1. Web visitor sets a cookie or a header with the malicious value;
  2. Web server sets the environment variable for the header to this value;
  3. Web server calls the application;
  4. Application calls anything else the easy way, via the shell which happens to be bash;
  5. Bug in bash is triggered and the code in the environment variable executed.

(This is an example with web servers, but other servers may be equally vulnerable — there are proof-of-concept attacks against certain DHCP and SIP servers as well).

So who’s to blame? Everybody and nobody. The system is so complex that unwanted behaviours like these emerge by themselves, as a result of the way the components are connected and interact together[2]. There is no single master architect that could’ve anticipated and guarded against this.

The insight about this emergent behaviour is nothing new, and was in fact described in detail in the research paper How Complex Systems Fail, a required reading for ops engineers at Google, Facebook, Amazon and other companies deploying huge computer systems. Although the paper doesn’t talk about security in specific, as Bruce Schneier puts it, it’s all fundamentally about security.

There is no cure. There’s no way we can design systems of such complexity, including security systems, so that they don’t fail (or can’t be exploited).

The best that we can do is to be well-equipped to handle the failures.


[0] Curiously enough, bash accepts -r option to activate restricted mode, in which this, and a host of other potentially problematic features, are turned of. The system function doesn’t use it though, because that’s not a standard POSIX shell option, it’s an addition from bash. Arguably, bash should detect it’s being called as a system shell and run in POSIX compatibility mode, but compatibility doesn’t necessarily forbid adding new features. In fact, bash, even when running in POSIX compatibility mode with --posix has the same behavior. Turtles all the way down.

[1] There are valid reasons to invoke sub-processes via the shell beyond the convenience of the system function: environment variable expansion (ironic, isn’t it?) or shell globbing come to mind.

[2] Note that only this specific combination of components is vulnerable. If the shell used is not bash, there is no problem. For example, dash is the default on newer Debian and Ubuntu systems. These systems may still be vulnerable if the user under which the server is running uses bash instead of the system shell, so the threat is still very real.