Friday, July 10, 2009

My near deathmarch experience

A couple of years ago, I fell asleep standing up in a server room. It was around my 30th birthday, on a January day in cold-as-hell Ohio. We had externally committed to a delivery date for on-site installation of our new system, and we were fighting hard to make the target. During the day, we were on site: migrating data, installing the client software and more computers, and training the staff on using the new system. At night, we were locked up in our hotel suite, trying to fix bugs we had run into during the day. No-one got more than three hours of sleep a day, including our manager (who was not programming but also did training, kept the customer happy, or simply closed certain feature gaps himself by doing the most amazing things with Crystal reports and data snapshots in an Access database).

On the day the system went live and was exposed to real load, we ran into things we had not seen before. One of them was what can only be called random server death: every now and then, without any warning in advance, a server would decide to completely freeze. All the king's horses, and all the king's men, couldn't convince the machine to unfreeze again. Only hitting the power button and restarting the machine could.

As we could not simply abort our launch, we took turns to guard those devilish devices of doom. Two of us would stay with the customer, while the third would sneak into the server room and wait for Dell to freeze over (pun intended -- although it really was not the hardware's fault!). And so it was my turn; I found myself surrounded my the soothing humming of power devices and the playful chirp of disks being accessed. It seemed so quiet and peaceful. All I remember is standing in front of a rack an leaning my head against the cool steel of the mount, just for a brief moment...

...and suddenly, I heard someone yelling: "the servers are down! the servers are down!" Had I been in worse physical shape, the shock would have given me a heart attack. Torn back into reality, I spun around towards the origin of the voice -- and stared into the grinning face of my boss, who had found me standing asleep in the server room and couldn't resist giving me the scare of my life. For what it's worth: the servers were fine.

Fast forward a couple of years, somewhere around October. I had just been through another launch; one that was possibly amongst the least stressful in my life. A bunch of new services went live, and one of them was my responsibility. Unlike in previous situations, I did not have any colleagues to "take shifts in the server room." If something broke, it was up to me to fix it. If I wanted a new feature, it was up to me to code it. If there was a bug report from a user, it was up to me to investigate it. Granted, the application was not as complex as what we tried to launch that cold winter in Ohio, but it was going to scale to many users, be up all the time, and not break beyond my control (or even worse -- randomly).

Lucky me, the new service was written in Google App Engine (well, what did you expect? This blog happens to be called "App Engine Fan" ;-) ). So, what did that mean for me?
  • In a previous life, I was on a pager rotation (actually, a dedicated support cell phone) to react to hard- and software issues that any of our customers might experience. For my new service, automated systems are monitoring the servers for me. As I am sharing them with thousands of other applications, any potential issue affects so many other people that it is quickly discovered and taken care of. Heck, I can even always peek at my application's status console from wherever I can get to a web browser. Or if it is a more general hearbeats I want, I can simply look at the overall system status.
  • In a previous life, I had to contact all my customers to perform critical updates. Anyone remember the daylight savings time change in the US in 2007? Well, we still had a lot of JDK 1.3 based systems out there, and upgrading them was a major effort (thus pain and cost). Now that my good little App Engine service runs in the Cloud, I can trust Google to take care of that stuff for me :-)
  • In a previous life, when stuff went wrong, I had to comb through operating system Event Logs, hoping to find the culprit hidden in some obscure error message that I could only decipher through a ton of Web searches. Nowadays, in the rare occasion that bad stuff happens, I can rely on a team of system experts to do the digging for me and supply me with detailed reports on what went wrong, why it went wrong and what will be done to prevent it in the future.
I am not going to draw any conclusions here on whether coding for App Engine is the best thing that ever happened to me. I actually believe that the winter in Ohio was well spent. It formed a strong bond within the team (yes, even with that guy who almost scared me to death), and we took many lessons away that eventually made our software better, stronger, and more useful to our clients. One thing I would like to mention though: I am glad that I won't be falling asleep in a server room anytime soon...

0 comments: