Sunday, January 25, 2009

Hacking Google App Engine (follow-up)

As mentioned last Tuesday at the meetup, I want to make the sources for this presentation easy to access. This brief post contains all the data necessary.

First, let's start with the basics:
As far as the program I used to demonstrate the hacks:
To try out the different hacks in a local dev environment,
  • go to http://localhost:8080/ on your local machine and enter a couple of chats
  • then, go to http://localhost:8080/stats to see the mini statistics
  • then, go to http://localhost:8080/shell and execute the following script(hitting Ctrl-Enter after each line):
    from google.appengine.api import datastore
    datastore.Query('Greeting').Get(100)
Happy Hacking :-)

Wednesday, January 21, 2009

Hacking Google App Engine (part 2)

(or how to dig into lower level APIs and customize them to your needs)
This is the second half of a talk I gave at a meetup on January 20th. If you were there and there is anything I forgot to jot down in this post, please let me know :-) I will put the slides and sources online soon. Click here for the first half of this article.

As I concluded in my previous post,
If we intend to create a "hack" that is as effective as possible, it behoves us to place it as low-level in our layer of APIs as possible. The higher the layer, the better the chance that some of the code may accidentally bypass it (another programmer might make use of the lower level APIs in a way we did not anticipate). Since the lowest level available to us is right before or after a remote procedure call is executed, that is the place we are going to focus on.

Obviously, I am not the first one to come up with that thought. The App Engine cookbook contains recipes that utilize the very same strategy, such as the "profiling datastore access" recipe. The general mode of operation in these recipes is to monkeypatch: replace apiproxy_stub_map.MakeSyncCall with a wrapper that performs additional logic before or after the rpc call is made (see this file for an example). While this approach will probably work in many cases, it is not without its dangers. Take for example a look at the following code snippet out of the memcache api in the sdk:

class Client(object):
"""Memcache client object, through which one invokes all memcache operations.
...
"""

def __init__(self, servers=None, debug=0,
pickleProtocol=pickle.HIGHEST_PROTOCOL,
pickler=pickle.Pickler,
unpickler=pickle.Unpickler,
pload=None,
pid=None,
make_sync_call=apiproxy_stub_map.MakeSyncCall):
"""Create a new Client object....
"""
...
self._make_sync_call = make_sync_call


As you can see, the client class stores a local reference to MakeSyncCall instead of invoking apiproxy_stub_map.MakeSyncCall directly. This is not an uncommon technique, since it makes replacing the method for unit tests significantly easier. However, this also means that whatever memcache-Client is created before we apply our patch will bypass our modifications alltogether. This could be for example the case if the memcache module was imported by a script before our patch was applied (static helper methods like memcache.get actually defer to such an internal client object). Tracking down and fixing all these references is complex, and there is no guarantee that a new version of an SDK (or any tool library that we might use for some other purpose) will not introduce new static dependencies later on.

Luckily, the App Engine team is aware of that danger, as well. Version 1.1.8 of the SDK introduces a new mechanism into api_proxy_map: the concept of hooks. Any developer can now define a method of the same signature as an rpc-call,

  def hook(service, call, request, response):
...


and register it with the runtime using one of the following methods:

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'unique_name', hook, 'optional_api_identifier')

apiproxy_stub_map.apiproxy.GetPostCallHooks().Append(
'unique_name', hook, 'optional_api_identifier')


There are two different types of hooks. A PreCallHook is executed before an RPC call is made, a PostCallHook is executed after the RPC call. It is also possible to specify an optional api identifier that will make sure that a hook only gets invoked for a particular type of rpc (such as datastore or memcache). The rest this post is going to demonstrate the power of hooks by solving two common (simplified) problems within an application:
  • Collect information about datastore use for profiling purposes.
  • Augment models with additional data before they are submitted to the datastore.
Hack #1: Collecting profiling data
Remember the example from the first post?
On our first day of work, the manager takes us aside and explains the assignment: "we are starting to get a little bit worried about the load on our datastore. Everything performs fine, but recent bills indicate that we put more load on the store as budgeted for. We think some queries must have gone rogue, but we cannot find them. Dig yourself into the code and track that down, will ya?[...]"

The following hack (or should I say "hook"?) is going to help us fix that. The idea is to automatically collect statistics for each model type in our database. We'd like separate counters for put/get/query/delete -- but without having to modify the model classes or handlers that access our data (after all, our fictitious app is large and brittle). For performance reasons, we decide not to write any of our results into the datastore. Instead, we collect some quick and dirty (not accurate or long-lived) stats in memcache, and put more detailed information into logging statements: each datastore operation creates a log entry, which can be retrieved and analyzed offline.

The first step is to create a helper that can collect data in memcache and log more verbose information. The following db_log achieves that objective

def db_log(model, call, details=''):
"""Call this method whenever the database is invoked.

Args:
model: the model name (aka kind) that the operation is on
call: the kind of operation (Get/Put/...)
details: any text that should be added to the detailed log entry.
"""

# First, let's update memcache
if model:
stats = memcache.get('DB_TMP_STATS')
if stats is None: stats = {}
key = '%s_%s' % (call, model)
stats[key] = stats.get(key, 0) + 1
memcache.set('DB_TMP_STATS', stats)

# Next, let's log for some more detailed analysis
logging.debug('DB_LOG: %s @ %s (%s)', call, model, details)


It is worth noting that the counts in memcache are only approximations, since we have no way of locking that data and performing increments in a transactional way. That's ok though, since it is only meant as a rough overview, compared to the more detailed data in the logs. We can create a very simple script that displays our statistics in a browser:

def main():
"""A very simple handler that will print the temporary statistics."""
print 'Content-Type: text/plain'
print ''
print 'Mini stats'
print '----------'
stats = memcache.get('DB_TMP_STATS')
if stats is None: stats = {}
for name, count in sorted(stats.items(), key=operator.itemgetter(0)):
print '%s : %s' % (name, count)
print '----------'


if __name__ == "__main__":
main()


Now, all we need is to actually write our hook. We use a PreCallHook that evaluates our data before the rpc call is made (a PostCallHook would work just as fine in this particular case, though):

def patch_appengine():
"""Apply a hook to app engine that logs db statistics."""
def model_name_from_key(key):
return key.path().element_list()[0].type()

def hook(service, call, request, response):
assert service == 'datastore_v3'
if call == 'Put':
for entity in request.entity_list():
db_log(model_name_from_key(entity.key()), call)
elif call in ('Get', 'Delete'):
for key in request.key_list():
db_log(model_name_from_key(key), call)
elif call == 'RunQuery':
kind = datastore_index.CompositeIndexForQuery(request)[1]
db_log(kind, call)
else:
db_log(None, call)

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'db_log', hook, 'datastore_v3')


Most of the effort spent in this code is about determining what the name of the model is that we are working on. If you are interested in how I obtained that knowledge, I recommend one of my earlier articles from the sqlite series. To make a long story short: take a look into the modules entity_pb.py and datastore_pb.py of the SDK. These contain the protocol buffers used for datastore RPCs.

So far, we have created
  • a helper that compiles datastore information in memcache (and logging)
  • a hook that executes the helper and
  • a simple handler to display the results.
All three components can be stored in a single module (let's call it db_log.py). We can activate the hack in our application by adding the following two lines of code to an application:

import db_log
db_log.patch_appengine()


This snippet should be placed somewhere where it is guaranteed to be executed, no matter what handler gets invoked. For example, if there is a common main method (in case of a django application), adding this before that main method gets defined would be a good location. Also, do not forget to add db_log.py to your app.yaml to get access to the quick-and-dirty memcache stats.

Hack #2: Per entity audit data
A couple of months ago, I posted an article about how to keep user specific data private. The idea was to define a new property type that would automagically store the "owner" of a Model object upon persisting it in the store (see also this second post on different ways to code this). While this approach is still a valid way of doing things, it also comes with a couple of disadvantages. For example, if you are using Django's ModelForm class to auto-generate html forms, you might need to comb through your code to make sure that your newly added field does not show up in your html. Wouldn't it be much easier, if your form would not even need to know that this user information was there?

The following hack addresses the first half of the problem: for every entity put into the datastore, it will augment the data by adding the name of the user (last_changed_by) and the timestamp of the last change (last_changed_at). Since this happens without the need to add properties to the model, our business logic would not need to be aware of that change.

The first component of our hack is a helper method that scans an entity protocol buffer for the existence of a particular property. If the property does not exist, a new property is added to the entity (for more information how entities and properties are connected, check out this post).

def get_or_create_property(entity, property_name):
"""Finds a property with a certain name in the entity.

Args:
entity: the entity protocol buffer to inspect
property_name: the name of the property to look for
Returns:
the existing property (or a newly created, if nothing was found)
"""
for prop in entity.property_list():
if prop.name() == property_name:
return prop
result = entity.add_property()
result.set_name(property_name)
result.set_multiple(False)
return result


Using this helper, augmenting an entity with a new property is a straightforward process:

  • Extract (or add) the property you would like to change using get_or_create_property.

  • Determine what value you would like to store.

  • Populate the property using one of the "pack" methods from datastore_types (PackString and PackDatetime in this example).


The following method shows how it's done:

def augment_entity(entity):
"""Adds when and by whom an entity was changed last.

Args:
entity: the entity protocol buffer to modify
"""
change_this = get_or_create_property(entity, 'last_changed_by')
new_value = os.environ.get('USER_EMAIL')
datastore_types.PackString(
'last_changed_by', new_value, change_this.mutable_value())
change_this = get_or_create_property(entity, 'last_changed_at')
new_value = datetime.datetime.now()
datastore_types.PackDatetime(
'last_changed_at', new_value, change_this.mutable_value())


Now that we have all the logic implemented, all that remains is to put this into a hook that we can register with api_proxy_stub_map:

def patch_appengine():
"""Apply a hook to app engine that stores audit-data with an entity."""
def hook(service, call, request, response):
assert service == 'datastore_v3'
if call == 'Put':
for entity in request.entity_list():
augment_entity(entity)

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append(
'entity_audit', hook, 'datastore_v3')


If we wanted to make sure that only data gets loaded that belongs to a particular user, we could write a second hook (a PostCallHook actually) that evaluates the created_by property for any get or query to the store and raises an exception if there is a mismatch. This way, no call to the datastore (whether it is done through models or lower level APIs) could bypass our security check.

To be done
While this concludes the content presented in my talk, there will still be one more followup post. I have yet to put the slides online, and I also want to compile a little bit more information about the sample application that I used to demonstrate these hacks to the audience. This will follow in the next couple of days, together with submissions of both hooks to the cookbook. I believe someone also taped the talk -- if you are putting that online, please mail me the link, so that I can include this in the next post.

Monday, January 19, 2009

Hacking Google App Engine (part 1)

(or how to dig into lower level APIs and customize them to your needs)
This is the first half of a talk I will give at a meetup on January 20th. If you were there and there is anything I forgot to jot down in this post, please let me know :-)

Google App Engine is a relatively young phenomenon -- it is still almost three months until the first anniversary of its launch to the public at a Campfire One. While there are already a lot of successful applications, tools and frameworks available, there is no way to tell in what creative and astounding ways developers will use the platform in the future.

Let us fast foward a couple of years and assume that we were just hired at theBestAppEngineSiteInTheWorld.com (or Tbaesitw in short, since I hate typing long words). Tbaesitw is one of the most successful, high traffic sites in the world. It is built on top of App Engine, using a hybrid of Django and Pylons that has been branched, modified and mutated so much that looking at it for more than 10 seconds would make Adrian Holovaty cry. It contains tens of thousands of lines of business logic. Models are defined all over the place; some even seem to be duplicate (but with inconsistent property definitions). Unit tests barely exist, and certain sections of the code are so full of Voodoo that the last two engineers trying to modify them are said to have mysteriously disappeared.

On our first day of work, the manager takes us aside and explains the assignment: "we are starting to get a little bit worried about the load on our datastore. Everything performs fine, but recent bills indicate that we put more load on the store as budgeted for. We think some queries must have gone rogue, but we cannot find them. Dig yourself into the code and track that down, will ya? Oh, and while you're at it; can you make some extensions to our models? We have had a hard time investigating customer complaints: they told us that somebody else must have gotten into their private data and made modifications. Is there any way you could add some fields that show when and by whom data was changed last? Oh... and don't break anything!"

Don't break anything... that is sometimes easier said the done. As code grows in the wild, it sometimes becomes hard to maintain and modify. Profiling datastore usage or adding "audit data" (who changed that row?) at all places in a program can be tough. How can we be sure that we did not miss any place in our codebase? How can we prevent that another developer extends the code afterwards without paying attention?

The situation described above is one of the cases where it makes more sense to "hack" ourselves into the underlying App Engine than trying to modify our business logic. By hacking, I mean in this context to change the way App Engine itself behaves instead of changing the way the business logic behaves. Typical use cases are to
  • Change the behavior of an existing application without having to modify a lot of sources,
  • Exchange a backend or service (different datastore, OpenID instead of Google accounts) but keep the existing API, or
  • Apply a crosscutting concern (like logging) to an entire application.
In other words, we are trying to make a big change with very little risk and code.

How to we hack?
To understand how to modify App Engine's general behavior, we have to get a better understanding of how the guts of the runtime work. Let's take a look at he following code snippet and trace through it in a debugger:
# Setup code
from google.appengine.api import apiproxy_stub_map
from google.appengine.api import datastore_file_stub
import os
os.environ['APPLICATION_ID'] = 'test'
stub = datastore_file_stub.DatastoreFileStub('test', None, None)
apiproxy_stub_map.apiproxy.RegisterStub('datastore_v3', stub)

# Model definition
from google.appengine.ext import db
class TestModel(db.Model):
text = db.StringProperty(default='some text')
save_this = TestModel()

# Step through this!!!
save_this.put()


The code instantiates an in-memory datastore (the magic incantation on how to do this is something I found in one forum post ;-), creates a simple model object and calls its put method. What we are interested in is what put actually does.

Stepping into the code reveals the implementation in the Model class (found in ext/db?__init__.py):
def put(self):
"""Writes this model instance to the datastore.

If this instance is new, we add an entity to the datastore.
Otherwise, we update this instance, and the key will remain the
same.

Returns:
The key of the instance (either the existing key or a new key).

Raises:
TransactionFailedError if the data could not be committed.
"""
self._populate_internal_entity()
return datastore.Put(self._entity)

It looks like our Model class converts our properties into a different datastructure (self._entity) and hands this over to a method call names datastore.Put. Let's step into that very method and see what happens next:
def Put(entities):
"""Store one or more entities in the datastore.
...
"""
...
req = datastore_pb.PutRequest()
req.entity_list().extend([e._ToPb() for e in entities])
...

resp = datastore_pb.PutResponse()
try:
apiproxy_stub_map.MakeSyncCall('datastore_v3', 'Put', req, resp)
except apiproxy_errors.ApplicationError, err:
raise _ToDatastoreError(err)
...

It looks as if another transformation is going on. Using the method _ToPb, our intermediate data structure gets converted yet into another object. This object gets stored in a datastore_pb.PutRequest. the PutRequest object and a PutResponse object then get handed to yet another module, apiproxy_stub_map:
def MakeSyncCall(service, call, request, response):
"""The APIProxy entry point.

Args:
service: string representing which service to call
call: string representing which function to call
request: protocol buffer for the request
response: protocol buffer for the response

Raises:
apiproxy_errors.Error or a subclass.
"""
stub = apiproxy.GetStub(service)
assert stub, 'No api proxy found for service "%s"' % service
stub.MakeSyncCall(service, call, request, response)


The following diagram summarizes the layers of framework that we peeled away from our little onion of code:

Our test code (or, more generally speaking, the application or business logic) uses App Engine's Model objects (high level APIs) to access the datastore. These models use a lower-level, less known datastore api to convert these high-level calls into a request/response format that is defined in a module called datatore_pb (the "pb" stands for "protocol buffer", a language independent data format). The request/response pair gets handed off to the apiproxy_stub_map, a module that performes rpc operations to lower layers inaccessible to the programmer. The responses are the handed back up the framework stack and interpreted accordingly.

If we intend to create a "hack" that is as effective as possible, it behoves us to place it as low-level in our layer of APIs as possible. The higher the layer, the better the chance that some of the code may accidentally bypass it (another programmer might make use of the lower level APIs in a way we did not anticipate). Since the lowest level available to us is right before or after a remote procedure call is executed, that is the place we are going to focus on. The next post in this series will demonstrate how we can use a recently released feature in apiproxy_stub_map to achieve this goal.

Sunday, January 11, 2009

Sharding woes

This week, an interesting discussion ran through the blogs. It started with an article called Mr.Moore gets to punt in sharding. That post spawned a lot of discussion, from the discussion thread on the blog to other pieces like Quick thought on sharding and Don't Bet on Moore Saving Your Ass. It seems like an interesting read to me, if one can be detached enough to stay on the sidelines in case another flame war breaks out ;-).

Personally, I think designing with scalability in mind is important -- however, implementing it from the beginning may be hard. There are many factors that influence how much effort one can put into this field at the beginning of a product lifecycle, such as

  • Time to market. (Can I afford the extra effort in the beginning, or will spending time on that put me out of business?)

  • Anticipated growth. (Not every software developer in the world needs to write apps that have to schale to millions of users -- think of niche tools that make money through subscriptions and licenses, not advertising)

  • Skill level of the team. (Sharding is not easy. Having inexperienced people trying to pull it off might result in worse performance than just using a standard three-tier architecture with an RDBMS)

  • Availability of tools. (Early releases often undergo lots of changes that will require data migration or fixing inconsistencies due to bugs. How can I (or my support team) accomplish this? Will I have to reinvent the wheel? See also "Time to market")

  • Contractual restrictions (My contract might require me to use a particular database, which could make sharding even more difficult)



I have not always worked at my current job, and I have been in situations in the past where tradeoffs both for and against sharding have been made. I have been saved by faster machines as well, but I am fully aware that this will not work for many situations. I have seen issues that could not be addressed, no matter how much memory one threw at the problem. For example, I have been in teams where a single architect was overseeing multiple projects at once, which resulted in projects where Java novices and graduates on their first job were making most decisions on data structure. I learned a lot at that time -- as Thomas J Watson said, "So go ahead and make mistakes. Make all you can. Because, remember that's where you'll find success.". Still, those mistakes were made on my employer's dime, and it would have been great if that could have been avoided. We got the system out in time, but we had to pay the price for it later, when we struggled to keep the system running smoothly while gradually introducing a more performant infrastructure.

With that in mind, I sometimes look at the restrictions within App Engine's datastore API, and I wonder if one shouldn't apply the same restraint on an RDBMS based project. It would be possible; for example if enough people helped getting the gae-sqlite to a point where it works on any rdbms and has feature parity with the real datastore. Developers could work on an API that encompasses all the best practices of Google's highly scalable architecture but still have the backend tools necessary to deal with the early woes of bugfixes and schema evolution. Then, once the system is mature enough and scalability becomes a concern, simply write a migration script to upload the data into the cloud and benefit from App Engine's scalability and reliability. Unless of course purchasing a faster server works well enough, and Mr. Moore really gets to punt on sharding...

Sunday, January 4, 2009

... and a happy new year :-)

Just wanted to wish everyone out there a great 2009 -- may your code be concise, your unit tests pass and your deadlines be mercifully long.

I'm currently starting to prepare my talk for the January meetup called "Hacking App Engine." If there's anything in particular you'd like covered; let me know. As long as it fits the theme, I'll do my best to put it in. If you can't make it: no worries; I'll make sure that slides or code snippets are available eventually.

Take care, everyone. Happy new year.