These thoughts are all mostly speculation, but it seems to me that it becomes harder and harder to scale beyond what we have now in Computer Systems. I am noticing this in HPC, but also in Datacenters. The roadblock isn’t really technical, but more in the way humans think.
When you reach single clusters that size up to a quarter of a million machines, you have to start thinking of systems administration statistically. What do I mean by that? Let’s make an example:
I have 100.000 identical machines, and their MTBF is, say (including all types of failures that make the machine offline, so network cables, OS etc), 4 years. That means that on average you will have 25.000 failures a year.
That translates to a failure every: 86400*365 / 25000 = 1261.44 s. That’s just over 21 minutes!
If the MTTR can be assumed to be about 10 minutes (normally by reprovisioning (assuming diskless) or changing a component/cable) you can see that it is unsafe to assume that all the machines will be up at the same time. In fact there is a very low probability of that.
This assumption will affect several things, like capacity planning. You should always take count of the fact that whatever system it is, it will never be 100% online. I mean, it might be, but it definitely won’t be a regular thing.
With so many failures you need a fast failover system, a lot of redundancy, and a quick way of turning machines around. Let’s say that you have 10.000 MySQL frondends and 60.000 Apache servers in that cluster (we are including “traditional” hosting in the argument now!), and all of a sudden you have a surge in queries for whatever reason (memcached fails?). You need to respond to that quickly and rebalance the system on the fly, so you need to automate server provisioning and joining of the cluster (and load balancers). You also need to do this for failures. Well, that throws DNS round robin out the window!
Also, it cannot be just automated. It has to be quick. And I mean lightning fast. Hard provisioning might not be an option, you might end up using diskless nodes. You will need to scale the provisioning system as a cluster itself. And even add automation to that. So provision the provisioning.
Now piece all this together and what you are left with is a big datacenter that’s constantly evolving and changing, so you can no longer even think of a machine being part of one cluster or the other. It’s just a big blob of varying things. A moving target. A probabilistic, non-deterministic system.
That is where statistics come into play. They are very good with these types of models, and soon we will be seeing a big change in systems administration and planning, to match this. It’s already starting to happen.