Wednesday, January 21, 2009

Hacking Google App Engine (part 2)

(or how to dig into lower level APIs and customize them to your needs)
This is the second half of a talk I gave at a meetup on January 20th. If you were there and there is anything I forgot to jot down in this post, please let me know :-) I will put the slides and sources online soon. Click here for the first half of this article.

As I concluded in my previous post,
If we intend to create a "hack" that is as effective as possible, it behoves us to place it as low-level in our layer of APIs as possible. The higher the layer, the better the chance that some of the code may accidentally bypass it (another programmer might make use of the lower level APIs in a way we did not anticipate). Since the lowest level available to us is right before or after a remote procedure call is executed, that is the place we are going to focus on.

Obviously, I am not the first one to come up with that thought. The App Engine cookbook contains recipes that utilize the very same strategy, such as the "profiling datastore access" recipe. The general mode of operation in these recipes is to monkeypatch: replace apiproxy_stub_map.MakeSyncCall with a wrapper that performs additional logic before or after the rpc call is made (see this file for an example). While this approach will probably work in many cases, it is not without its dangers. Take for example a look at the following code snippet out of the memcache api in the sdk:

class Client(object):
"""Memcache client object, through which one invokes all memcache operations.
...
"""

def __init__(self, servers=None, debug=0,
pickleProtocol=pickle.HIGHEST_PROTOCOL,
pickler=pickle.Pickler,
unpickler=pickle.Unpickler,
pload=None,
pid=None,
make_sync_call=apiproxy_stub_map.MakeSyncCall):
"""Create a new Client object....
"""
...
self._make_sync_call = make_sync_call


As you can see, the client class stores a local reference to MakeSyncCall instead of invoking apiproxy_stub_map.MakeSyncCall directly. This is not an uncommon technique, since it makes replacing the method for unit tests significantly easier. However, this also means that whatever memcache-Client is created before we apply our patch will bypass our modifications alltogether. This could be for example the case if the memcache module was imported by a script before our patch was applied (static helper methods like memcache.get actually defer to such an internal client object). Tracking down and fixing all these references is complex, and there is no guarantee that a new version of an SDK (or any tool library that we might use for some other purpose) will not introduce new static dependencies later on.

Luckily, the App Engine team is aware of that danger, as well. Version 1.1.8 of the SDK introduces a new mechanism into api_proxy_map: the concept of hooks. Any developer can now define a method of the same signature as an rpc-call,

  def hook(service, call, request, response):
...


and register it with the runtime using one of the following methods:

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'unique_name', hook, 'optional_api_identifier')

apiproxy_stub_map.apiproxy.GetPostCallHooks().Append(
'unique_name', hook, 'optional_api_identifier')


There are two different types of hooks. A PreCallHook is executed before an RPC call is made, a PostCallHook is executed after the RPC call. It is also possible to specify an optional api identifier that will make sure that a hook only gets invoked for a particular type of rpc (such as datastore or memcache). The rest this post is going to demonstrate the power of hooks by solving two common (simplified) problems within an application:
  • Collect information about datastore use for profiling purposes.
  • Augment models with additional data before they are submitted to the datastore.
Hack #1: Collecting profiling data
Remember the example from the first post?
On our first day of work, the manager takes us aside and explains the assignment: "we are starting to get a little bit worried about the load on our datastore. Everything performs fine, but recent bills indicate that we put more load on the store as budgeted for. We think some queries must have gone rogue, but we cannot find them. Dig yourself into the code and track that down, will ya?[...]"

The following hack (or should I say "hook"?) is going to help us fix that. The idea is to automatically collect statistics for each model type in our database. We'd like separate counters for put/get/query/delete -- but without having to modify the model classes or handlers that access our data (after all, our fictitious app is large and brittle). For performance reasons, we decide not to write any of our results into the datastore. Instead, we collect some quick and dirty (not accurate or long-lived) stats in memcache, and put more detailed information into logging statements: each datastore operation creates a log entry, which can be retrieved and analyzed offline.

The first step is to create a helper that can collect data in memcache and log more verbose information. The following db_log achieves that objective

def db_log(model, call, details=''):
"""Call this method whenever the database is invoked.

Args:
model: the model name (aka kind) that the operation is on
call: the kind of operation (Get/Put/...)
details: any text that should be added to the detailed log entry.
"""

# First, let's update memcache
if model:
stats = memcache.get('DB_TMP_STATS')
if stats is None: stats = {}
key = '%s_%s' % (call, model)
stats[key] = stats.get(key, 0) + 1
memcache.set('DB_TMP_STATS', stats)

# Next, let's log for some more detailed analysis
logging.debug('DB_LOG: %s @ %s (%s)', call, model, details)


It is worth noting that the counts in memcache are only approximations, since we have no way of locking that data and performing increments in a transactional way. That's ok though, since it is only meant as a rough overview, compared to the more detailed data in the logs. We can create a very simple script that displays our statistics in a browser:

def main():
"""A very simple handler that will print the temporary statistics."""
print 'Content-Type: text/plain'
print ''
print 'Mini stats'
print '----------'
stats = memcache.get('DB_TMP_STATS')
if stats is None: stats = {}
for name, count in sorted(stats.items(), key=operator.itemgetter(0)):
print '%s : %s' % (name, count)
print '----------'


if __name__ == "__main__":
main()


Now, all we need is to actually write our hook. We use a PreCallHook that evaluates our data before the rpc call is made (a PostCallHook would work just as fine in this particular case, though):

def patch_appengine():
"""Apply a hook to app engine that logs db statistics."""
def model_name_from_key(key):
return key.path().element_list()[0].type()

def hook(service, call, request, response):
assert service == 'datastore_v3'
if call == 'Put':
for entity in request.entity_list():
db_log(model_name_from_key(entity.key()), call)
elif call in ('Get', 'Delete'):
for key in request.key_list():
db_log(model_name_from_key(key), call)
elif call == 'RunQuery':
kind = datastore_index.CompositeIndexForQuery(request)[1]
db_log(kind, call)
else:
db_log(None, call)

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'db_log', hook, 'datastore_v3')


Most of the effort spent in this code is about determining what the name of the model is that we are working on. If you are interested in how I obtained that knowledge, I recommend one of my earlier articles from the sqlite series. To make a long story short: take a look into the modules entity_pb.py and datastore_pb.py of the SDK. These contain the protocol buffers used for datastore RPCs.

So far, we have created
  • a helper that compiles datastore information in memcache (and logging)
  • a hook that executes the helper and
  • a simple handler to display the results.
All three components can be stored in a single module (let's call it db_log.py). We can activate the hack in our application by adding the following two lines of code to an application:

import db_log
db_log.patch_appengine()


This snippet should be placed somewhere where it is guaranteed to be executed, no matter what handler gets invoked. For example, if there is a common main method (in case of a django application), adding this before that main method gets defined would be a good location. Also, do not forget to add db_log.py to your app.yaml to get access to the quick-and-dirty memcache stats.

Hack #2: Per entity audit data
A couple of months ago, I posted an article about how to keep user specific data private. The idea was to define a new property type that would automagically store the "owner" of a Model object upon persisting it in the store (see also this second post on different ways to code this). While this approach is still a valid way of doing things, it also comes with a couple of disadvantages. For example, if you are using Django's ModelForm class to auto-generate html forms, you might need to comb through your code to make sure that your newly added field does not show up in your html. Wouldn't it be much easier, if your form would not even need to know that this user information was there?

The following hack addresses the first half of the problem: for every entity put into the datastore, it will augment the data by adding the name of the user (last_changed_by) and the timestamp of the last change (last_changed_at). Since this happens without the need to add properties to the model, our business logic would not need to be aware of that change.

The first component of our hack is a helper method that scans an entity protocol buffer for the existence of a particular property. If the property does not exist, a new property is added to the entity (for more information how entities and properties are connected, check out this post).

def get_or_create_property(entity, property_name):
"""Finds a property with a certain name in the entity.

Args:
entity: the entity protocol buffer to inspect
property_name: the name of the property to look for
Returns:
the existing property (or a newly created, if nothing was found)
"""
for prop in entity.property_list():
if prop.name() == property_name:
return prop
result = entity.add_property()
result.set_name(property_name)
result.set_multiple(False)
return result


Using this helper, augmenting an entity with a new property is a straightforward process:

  • Extract (or add) the property you would like to change using get_or_create_property.

  • Determine what value you would like to store.

  • Populate the property using one of the "pack" methods from datastore_types (PackString and PackDatetime in this example).


The following method shows how it's done:

def augment_entity(entity):
"""Adds when and by whom an entity was changed last.

Args:
entity: the entity protocol buffer to modify
"""
change_this = get_or_create_property(entity, 'last_changed_by')
new_value = os.environ.get('USER_EMAIL')
datastore_types.PackString(
'last_changed_by', new_value, change_this.mutable_value())
change_this = get_or_create_property(entity, 'last_changed_at')
new_value = datetime.datetime.now()
datastore_types.PackDatetime(
'last_changed_at', new_value, change_this.mutable_value())


Now that we have all the logic implemented, all that remains is to put this into a hook that we can register with api_proxy_stub_map:

def patch_appengine():
"""Apply a hook to app engine that stores audit-data with an entity."""
def hook(service, call, request, response):
assert service == 'datastore_v3'
if call == 'Put':
for entity in request.entity_list():
augment_entity(entity)

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'entity_audit', hook, 'datastore_v3')


If we wanted to make sure that only data gets loaded that belongs to a particular user, we could write a second hook (a PostCallHook actually) that evaluates the created_by property for any get or query to the store and raises an exception if there is a mismatch. This way, no call to the datastore (whether it is done through models or lower level APIs) could bypass our security check.

To be done
While this concludes the content presented in my talk, there will still be one more followup post. I have yet to put the slides online, and I also want to compile a little bit more information about the sample application that I used to demonstrate these hacks to the audience. This will follow in the next couple of days, together with submissions of both hooks to the cookbook. I believe someone also taped the talk -- if you are putting that online, please mail me the link, so that I can include this in the next post.

5 comments:

Antonin Hildebrand said...

Really interesting reading. Thanks for the article.

Jeff said...

I'm using this method to handle sharded counters and it's working well except for one problem I can't seem to solve. The "Put" call applies to both Add and Update but I only want my counters incremented on Add. Do you know of any way in these hooks to tell if it's being added or updated?

Thanks.

The App Engine Fan said...

Hi Jeff,

Unfortunately, I do not :-(. I would assume that there is no way to preemptively know whether a put is an add or an update, unless one is several layers deeper, much closer to the store (which we do not really have access to). I assume you are using a postread hook?

Cheers,

Jens

We Love Drupal said...

Thank you for your very helpful article. I was wondering if you could elaborate more about how to write the second hook you talk about at the end of your post. I am trying to do almost exactly what you describe but am having trouble gaining access to the request and response object. Basically i am trying to intercept a get or post request anywhere in my application, examine the user against some permission tables, and alter the actual get or post request based on the results of my query against a permissions table. When I look at the class type for the request passed in by the stub it reports itself as google.appengine.datastore.datastore_pb.Query or NextRequest or QueryResult. Can you help? Thank you.

The App Engine Fan said...

Hi,

> I was wondering if you could elaborate more about how to write the second hook you talk about at the end of your post.

I have links to the sources and more information here:

http://blog.appenginefan.com/2009/01/hacking-google-app-engine-follow-up.html

Is that what you are looking for? Amongst other things, it contains this link for a cookbook entry (with source code):

http://appengine-cookbook.appspot.com/recipe/per-entity-audit-data-last_changed_by-and-last_changed_at

Cheers,

Jens