One of his examples shows was a more efficient way of building a global counter: the naive approach I have been guilty of in the past (store the count in an entity and update it in a transaction) turns out to be a performance bottleneck, because this means that all transactions have to wait on a single lock for the entity. Instead, he suggested to split the counter into several shards, thus distributing the locking over more than one entity group. I don't recall the exact implementation from memory, but it went something like this:
from google.appengine.ext import db
import random
SHARDS_PER_COUNTER = 20
class CounterShard(db.Model):
name = db.StringProperty(required=True)
count = db.IntegerProperty(default=0)
def GetCount(nameOfCounter):
result = 0
for shard in CounterShard.gql('WHERE name=:1', nameOfCounter):
result += shard.count
return result
def ChangeCount(nameOfCounter, delta):
shard_id = '/%s/%s' % (
nameOfCounter, random.randint(1, SHARDS_PER_COUNTER))
def update():
shard = CounterShard.get_by_key_name(shard_id)
if shard:
shard.count += delta
else:
shard = CounterShard(
key_name=shard_id, name=nameOfCounter, count=delta)
shard.put()
db.run_in_transaction(update)
Pretty nice, and it gets even better: using the newly announced Memcache API, we can keep the counter in memory and only update it once in a while (which is acceptable for many use cases). The following modification keeps the counter in memory for a minute before reloading it:
def GetCount(nameOfCounter):
memcache_id = '/CounterShard/%s' % nameOfCounter
result = memcache.get(memcache_id)
if not (result == None):
return result
result = 0
for shard in CounterShard.gql('WHERE name=:1', nameOfCounter):
result += shard.count
memcache.set(memcache_id, result, 60)
return result
Brett, thanks for giving this talk. I'm looking forward to trying this out in my applications!
4 comments:
Cool, thanks for posting this. I was in that talk and you must have remembered it better than I.
Did you make Guido's talk on making Django work with App Engine? I watched it but as a newbie to both python and django I definitely don't remember all the steps he went through, and as far as I can tell they have not posted it yet.
I don't understand why it would be ok in your counter example to only update the data store "once in a while". After all, nothing stored in memcache is guaranteed to be there the next time you check... right?
> Did you make Guido's talk on
> making Django work with App Engine?
No, I was on a different "track" at that time. GWT, if I remember correctly :-)
> I don't understand why it
> would be ok in
> your counter example
> to only update the
> data store "once in a while".
Sorry, I might have miscommunicated that. I meant that we only update the memcache once in a while. In other words, changes to the counter in the database will not be reflected in the get-method for a little while (60 seconds in my code).
If you look at the code sample in the blog post, you will see that I only modified the GetCount-method, but not ChangeCount(). I do this by storing the value in the cache and setting an expiration date. Thus, all changes to the overall counter will be persisted immediately, but the value will only be re-read from the persistent storage once in a while.
A question:
You're setting up your key_names in the form '/CounterName/ShardNumber' -- is this a short-hand method of creating parent/child relationships? Ryan Barrett showed a similar syntax in his Google IO talk when talking about how rows are stored in Bigtable but I cannot find documentation about this anywhere.
In your code, run_in_transaction works so I'm assuming that this is somehow creating an entity group but when I call .parent() on a counter shard, I get None.
Just trying to get my head around this -- would appreciate a clarification on how this works.
Thanks! :)
> You're setting up your key_names
> in the form '/CounterName
> /ShardNumber' -- is this a short-hand > method of creating parent/child
> relationships?
No, it is just a way to name the persistent entities. Check out Brett's talk (http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine) for more of the rationale behind it. The idea is to be able to not having to use GQL but to be able to look up all entities by their primary key, which is significantly faster.
> Ryan Barrett showed a similar
> syntax in his Google IO talk
> when talking about how
> rows are stored in Bigtable
> but I cannotfind documentation
> about this anywhere
You can find links to all AppEngine related I/O sessions here: http://googleappengine.blogspot.com/2008/06/google-io-session-videos-posted-with.html
Post a Comment