Sunday, May 25, 2008

Framework Darwinism on the bleeding edge of technology

I was recently looking for some information on how to handle session related data in the Google App Engine. Without too much effort, I found three different approaches to handling the issue:
  • Appengine Utilities, a simple solution that targeted the simple use case I was looking for
  • Django App Engine Utilities, an extension for the very successful Django framework (requires to upgrade to a newer version than what comes with App Engine out of the box, though)
  • Beaker, a relatively sophisticated caching library that supposedly ties into multiple frameworks
I was very happy that I would be able to use an existing solution. I was even happier that enough people cared about App Engine to create useful code and put it out for everyone to use. Alas, it also made me a little worried: how would I be able to choose the right solution in the long run? Should I go with something that
  • just works for my simple use case?
  • works for the framework I currently use but requires me to upgrade?
  • go with a library that does more than I currently need, just because it "might" be useful in the future?
The problem with a relatively young technology is that there is an explosion of libraries and it is everyone's guess what will work out best in the long term. We have all been in similar situations before: like in early 2000, when my employer of that time was building a complex Java-based system and had to decide on what logging mechanism to use. At that time, the standard was still in its drafting stage and the existing frameworks out there only partially compatible. Eventually, we decided to go with a framework from ibm alphaworks, but wrap it in a layer that was closer to the draft standard of that time. Had we made the right decision? Hard to say in retrospect -- especially since we had quite a few do-overs in the next couple of years: Other frameworks, especially log4j, emerged that provided a much higher degree of features and performance than what we had before. At the same time, we tried to play catchup with changes in the official standard (some of them significant) without having to re-work all of our existing logging code. In the end, we ended up with a home-brewn wrapper not unlike commons logging that was not quite like the standard but did its job for us.

So what's my point? Forget about open source projects out there and re-invent the wheel? Not quite: if there is something well established that does the job, use it by all means. However, for a young technlogy like App Engine, it can be expected that things will be in flux for a while. The definite tools of the trade are yet to emerge -- whatever third party libraries you pick, you can bet that a certain percentage of them are going to become obsolete during the lifetime of your project. Unless they are very easy to rip out and replace, always consider a thin layer of isolation. At least, that's what I will do for my session cache.

Saturday, May 10, 2008

The darker side of multiplexing, or how to prevent site hijacking

About a month ago, I mentioned in a post about multiplexing that -- since a single App Engine application can be bound to many domains -- a developer can leverage this knowledge to run several applications from the same app id or even run the same app within several namespaces (see my shortlinker running at aef.appspot.com and links.appenginefan.com). Unfortunately, I forgot to mention the darker side of this App Engine Feature: site hijacking.

Suppose you have a cool application at "mySuperDuperCoolApp.foo" that you expect to make you lots of money once it has a broad fanbase. Unfortunately, a couple of people discover your app-id on appspot.com and decide to "link" your app into their own site. Suddenly, your app is hosted at a multitude of domains that you do not own. Not only does this dillute the strength of the mySuperDuperCoolApp.foo brand -- it also enables the hijackers to point the url someplace else at a later point in time and steal your users (once they have all your features cloned). That's unacceptabe!


The following script (I call it "linksteal.py") is an example how this can be circumvented. The script assumes that you are using the webapp framework, but it could easily be adapted to the framework of your choice. Just import it in all your handler-scripts and you should be fine:
from google.appengine.ext import webapp

# A list of domains that are permissible
ALLOWED_DOMAINS = ('localhost', 'aef.appspot.com')

# The main domain of this application
MAIN_DOMAIN = 'aef.appspot.com'

# The original __call-- method, replace wih our own
original_call = webapp.WSGIApplication.__call__

# The new call method that checks the domain first
def new_call(self, environ, start_response):
if environ['SERVER_NAME'] in ALLOWED_DOMAINS:
return original_call(self, environ, start_response)
start_response('403 Invalid URL (content stolen?)',
[('Content-type', 'text/html')])
return ["""<html><body>
The URL requested belongs to a site that should not
be accessible through this domain. Please go to
<a href="%s">%s</a> instead.
</body></html>""" % (MAIN_DOMAIN, MAIN_DOMAIN)]

webapp.WSGIApplication.__call__ = new_call


So, how does this exactly work? In ALLOWED_DOMAINS, we specify a list of domains that are "legal" hosts of our application (localhost is included to make the local dev server work). MAIN_DOMAIN is the name of the main domain that we should refer users to that clicked on a "stolen link". We monkey-patch the WSGIApplication to wrap the original __call__-method with a check of the server-name. If the servername is in the permitted list, we call the original handler (and thus the code we wrote). Otherwise, we return a 403 and refer the user to the MAIN_DOMAIN.

By the way: this script can also be used to disable the hosting of the app at "appspot.com". Just take it out of the list of ALLOWED_DOMAINS.

Saturday, May 3, 2008

PHP sucks, and other flame wars

Don't you love it when people take a simple innocent question and turn it into a language war? There are variations, but here is how it usually begins:
  • Person A: "I'd like to work in language XYZ"
  • Person B: "XYZ sucks and so do people using XYZ"
  • Person A: ???
Paul Graham wrote an interesting classification of ways to disagree, and I am happy to see that (in this particular case), people have actually been too smart to simply take the bait and start calling each other names ;-) There is a big community of PHP developers out there, and I am glad that they are looking for constructive ways of making their case, rather than getting tricked into some artificial dispute. Personally, I have not signed the petition. It's not that I don't like the language -- I simply don't care enough. Languages are tools -- some work better than others for a particular use case, but they will all get the job done eventually.

Instead of focusing on new languages, I think people should rather try working with what's out there (python) and rather request new features and APIs. Having said that: if I could wave a magic wand and have a new programming environment supported, it would probably be the JVM. Not Java per se, but the virtual machine its bytecode gets compiled to. Why? Well, it has the potential of giving us not one new language, but an entire set, such as
  • Java
  • PHP (through Quercus)
  • Groovy
  • Scala
  • Javascript (through Rhino)
What do you guys think? Let the flame wars begin...