Monday, January 19, 2009

Hacking Google App Engine (part 1)

(or how to dig into lower level APIs and customize them to your needs)
This is the first half of a talk I will give at a meetup on January 20th. If you were there and there is anything I forgot to jot down in this post, please let me know :-)

Google App Engine is a relatively young phenomenon -- it is still almost three months until the first anniversary of its launch to the public at a Campfire One. While there are already a lot of successful applications, tools and frameworks available, there is no way to tell in what creative and astounding ways developers will use the platform in the future.

Let us fast foward a couple of years and assume that we were just hired at theBestAppEngineSiteInTheWorld.com (or Tbaesitw in short, since I hate typing long words). Tbaesitw is one of the most successful, high traffic sites in the world. It is built on top of App Engine, using a hybrid of Django and Pylons that has been branched, modified and mutated so much that looking at it for more than 10 seconds would make Adrian Holovaty cry. It contains tens of thousands of lines of business logic. Models are defined all over the place; some even seem to be duplicate (but with inconsistent property definitions). Unit tests barely exist, and certain sections of the code are so full of Voodoo that the last two engineers trying to modify them are said to have mysteriously disappeared.

On our first day of work, the manager takes us aside and explains the assignment: "we are starting to get a little bit worried about the load on our datastore. Everything performs fine, but recent bills indicate that we put more load on the store as budgeted for. We think some queries must have gone rogue, but we cannot find them. Dig yourself into the code and track that down, will ya? Oh, and while you're at it; can you make some extensions to our models? We have had a hard time investigating customer complaints: they told us that somebody else must have gotten into their private data and made modifications. Is there any way you could add some fields that show when and by whom data was changed last? Oh... and don't break anything!"

Don't break anything... that is sometimes easier said the done. As code grows in the wild, it sometimes becomes hard to maintain and modify. Profiling datastore usage or adding "audit data" (who changed that row?) at all places in a program can be tough. How can we be sure that we did not miss any place in our codebase? How can we prevent that another developer extends the code afterwards without paying attention?

The situation described above is one of the cases where it makes more sense to "hack" ourselves into the underlying App Engine than trying to modify our business logic. By hacking, I mean in this context to change the way App Engine itself behaves instead of changing the way the business logic behaves. Typical use cases are to
  • Change the behavior of an existing application without having to modify a lot of sources,
  • Exchange a backend or service (different datastore, OpenID instead of Google accounts) but keep the existing API, or
  • Apply a crosscutting concern (like logging) to an entire application.
In other words, we are trying to make a big change with very little risk and code.

How to we hack?
To understand how to modify App Engine's general behavior, we have to get a better understanding of how the guts of the runtime work. Let's take a look at he following code snippet and trace through it in a debugger:
# Setup code
from google.appengine.api import apiproxy_stub_map
from google.appengine.api import datastore_file_stub
import os
os.environ['APPLICATION_ID'] = 'test'
stub = datastore_file_stub.DatastoreFileStub('test', None, None)
apiproxy_stub_map.apiproxy.RegisterStub('datastore_v3', stub)

# Model definition
from google.appengine.ext import db
class TestModel(db.Model):
text = db.StringProperty(default='some text')
save_this = TestModel()

# Step through this!!!
save_this.put()


The code instantiates an in-memory datastore (the magic incantation on how to do this is something I found in one forum post ;-), creates a simple model object and calls its put method. What we are interested in is what put actually does.

Stepping into the code reveals the implementation in the Model class (found in ext/db?__init__.py):
def put(self):
"""Writes this model instance to the datastore.

If this instance is new, we add an entity to the datastore.
Otherwise, we update this instance, and the key will remain the
same.

Returns:
The key of the instance (either the existing key or a new key).

Raises:
TransactionFailedError if the data could not be committed.
"""
self._populate_internal_entity()
return datastore.Put(self._entity)

It looks like our Model class converts our properties into a different datastructure (self._entity) and hands this over to a method call names datastore.Put. Let's step into that very method and see what happens next:
def Put(entities):
"""Store one or more entities in the datastore.
...
"""
...
req = datastore_pb.PutRequest()
req.entity_list().extend([e._ToPb() for e in entities])
...

resp = datastore_pb.PutResponse()
try:
apiproxy_stub_map.MakeSyncCall('datastore_v3', 'Put', req, resp)
except apiproxy_errors.ApplicationError, err:
raise _ToDatastoreError(err)
...

It looks as if another transformation is going on. Using the method _ToPb, our intermediate data structure gets converted yet into another object. This object gets stored in a datastore_pb.PutRequest. the PutRequest object and a PutResponse object then get handed to yet another module, apiproxy_stub_map:
def MakeSyncCall(service, call, request, response):
"""The APIProxy entry point.

Args:
service: string representing which service to call
call: string representing which function to call
request: protocol buffer for the request
response: protocol buffer for the response

Raises:
apiproxy_errors.Error or a subclass.
"""
stub = apiproxy.GetStub(service)
assert stub, 'No api proxy found for service "%s"' % service
stub.MakeSyncCall(service, call, request, response)


The following diagram summarizes the layers of framework that we peeled away from our little onion of code:

Our test code (or, more generally speaking, the application or business logic) uses App Engine's Model objects (high level APIs) to access the datastore. These models use a lower-level, less known datastore api to convert these high-level calls into a request/response format that is defined in a module called datatore_pb (the "pb" stands for "protocol buffer", a language independent data format). The request/response pair gets handed off to the apiproxy_stub_map, a module that performes rpc operations to lower layers inaccessible to the programmer. The responses are the handed back up the framework stack and interpreted accordingly.

If we intend to create a "hack" that is as effective as possible, it behoves us to place it as low-level in our layer of APIs as possible. The higher the layer, the better the chance that some of the code may accidentally bypass it (another programmer might make use of the lower level APIs in a way we did not anticipate). Since the lowest level available to us is right before or after a remote procedure call is executed, that is the place we are going to focus on. The next post in this series will demonstrate how we can use a recently released feature in apiproxy_stub_map to achieve this goal.

2 comments:

Matthew Casperson said...

If you're looking for a quick introduction to the Google App Engine check out http://www.squidoo.com/Google-App-Engine

mdirolf said...

Check out the MongoDB AppEngine Connector. It works by replacing the datastore_file_stub with a stub that uses MongoDB as a backend. Not exactly what this post is getting at, but still a good example of hacking at AppEngine internals.