On the day the system went live and was exposed to real load, we ran into things we had not seen before. One of them was what can only be called random server death: every now and then, without any warning in advance, a server would decide to completely freeze. All the king's horses, and all the king's men, couldn't convince the machine to unfreeze again. Only hitting the power button and restarting the machine could.
As we could not simply abort our launch, we took turns to guard those devilish devices of doom. Two of us would stay with the customer, while the third would sneak into the server room and wait for Dell to freeze over (pun intended -- although it really was not the hardware's fault!). And so it was my turn; I found myself surrounded my the soothing humming of power devices and the playful chirp of disks being accessed. It seemed so quiet and peaceful. All I remember is standing in front of a rack an leaning my head against the cool steel of the mount, just for a brief moment...
...and suddenly, I heard someone yelling: "the servers are down! the servers are down!" Had I been in worse physical shape, the shock would have given me a heart attack. Torn back into reality, I spun around towards the origin of the voice -- and stared into the grinning face of my boss, who had found me standing asleep in the server room and couldn't resist giving me the scare of my life. For what it's worth: the servers were fine.
Fast forward a couple of years, somewhere around October. I had just been through another launch; one that was possibly amongst the least stressful in my life. A bunch of new services went live, and one of them was my responsibility. Unlike in previous situations, I did not have any colleagues to "take shifts in the server room." If something broke, it was up to me to fix it. If I wanted a new feature, it was up to me to code it. If there was a bug report from a user, it was up to me to investigate it. Granted, the application was not as complex as what we tried to launch that cold winter in Ohio, but it was going to scale to many users, be up all the time, and not break beyond my control (or even worse -- randomly).
Lucky me, the new service was written in Google App Engine (well, what did you expect? This blog happens to be called "App Engine Fan" ;-) ). So, what did that mean for me?
- In a previous life, I was on a pager rotation (actually, a dedicated support cell phone) to react to hard- and software issues that any of our customers might experience. For my new service, automated systems are monitoring the servers for me. As I am sharing them with thousands of other applications, any potential issue affects so many other people that it is quickly discovered and taken care of. Heck, I can even always peek at my application's status console from wherever I can get to a web browser. Or if it is a more general hearbeats I want, I can simply look at the overall system status.
- In a previous life, I had to contact all my customers to perform critical updates. Anyone remember the daylight savings time change in the US in 2007? Well, we still had a lot of JDK 1.3 based systems out there, and upgrading them was a major effort (thus pain and cost). Now that my good little App Engine service runs in the Cloud, I can trust Google to take care of that stuff for me :-)
- In a previous life, when stuff went wrong, I had to comb through operating system Event Logs, hoping to find the culprit hidden in some obscure error message that I could only decipher through a ton of Web searches. Nowadays, in the rare occasion that bad stuff happens, I can rely on a team of system experts to do the digging for me and supply me with detailed reports on what went wrong, why it went wrong and what will be done to prevent it in the future.
0 comments:
Post a Comment