Even Worse Than It Appears…

Written by mao
June 10th, 2008 | 3 Comments | Posted in Hyperic HQ

Two things today.

First, thanks to Cote, Matt, Javier and others for their kind words. I am tremendously excited to be working with Hyperic. I’ve liked the company for a long time, and I’m even more impressed by the team and the strategy now that I am spending a couple of days a week here. I haven’t abandoned retirement altogether, but I have allowed it to erode a little bit because I like this opportunity so much.

Second, I want to pile on Javier’s post on the availability issues at Amazon over the past several days.

It’s worth pointing out that it had to be a pretty lousy weekend for the people responsible for running Amazon’s infrastructure. If you take a step back, the only reason that the downtime is remarkable is because it’s so rare. Nobody blinks when Twitter goes off-line. When my own favorite retailer since 1997 disappears, you notice it, because it simply never happens. I bet that things settle down quickly and we get back to the fast and reliable storefront that we’ve all come to expect.

Javier and others have touched on a likely cause of the outage: Complexity. As systems get more moving parts, they become harder to monitor and maintain. Many hope that the move to cloud computing will make things better; as you use infrastructure in the cloud, the thinking goes, you’ll be able to rely on the cloud service provider to keep it running.

As the downtime with Amazon’s storefront demonstrates, that’s a false hope. If you rely on computing services anywhere, you need to monitor them, and you need to understand how their availability affects your operations. IT shops are running more applications — JBoss, Tomcat, MySQL, home-grown software to run their businesses, along with the laundry list of proprietary and legacy applications they’ve installed over the years. These interact with one another. Every one of these software programs, and every connection among them, is a new potential source of failure.

We hardly ever abandon old systems and infrastructure. We only add new ones. Increased complexity is an irresistible force of nature, and managing it requires new techniques and new tools.

Back to the Amazon outage specifically: I’ve seen a couple of quotes in the media from people who have said, more or less, “Gee, whatever they changed that messed things up, they should have changed it during off hours.”

The fact of the matter is that there is no longer any such thing as “off hours.” For Amazon certainly, the storefront runs constantly. It may be nighttime in North America, but it’s daylight in Eastern Europe and Asia. More and more businesses — and especially those that deliver services over the Internet — simply never get to shut their computers down for maintenance. Their operations infrastructure has to take that into account.

More software running on more hardware in more places equals more complexity. At the same time, users all over the globe expect instantaneous access to data and services from anyplace, anytime. That combination means that IT professionals are staring at some pretty serious problems. The situation is even worse than it appears, though: For many businesses, as for Amazon, if the computers go down, the money stops flowing.

I’m glad to be at Hyperic because we’re working on the hard problems. Manageability of core infrastructure is the iceberg in front of most businesses, these days.

If you like this post then please consider subscribing to our RSS feed. You can also subscribe by email and have new articles sent directly to your inbox.

Leave a Reply 1645 views, 1 so far today |

Related Posts

Follow Discussion

3 Responses to “Even Worse Than It Appears…”

Trackbacks

  1. A Tale of Two Outages | IT's About Uptime - The StackSafe Blog  
  2. links for 2008-06-12 — dougmcclure.net  
  3. People Over Process » links for 2008-06-12  

Leave a Reply