While we are waiting with bated breath for the Java version, it is worth taking a closer look at the python implementation. AppStats is the poster child of an extremely useful library that, if done wrong, could have made life very uncomfortable for its users. First of all, it can collect statistics upon every rpc within every http request in the app -- that is a lot of data! Where will this data be stored? If one used the datastore,that would mean there that all those statistics would count against one's quota. Also, if the library happened to use an entity name that a user's application also used, data might get corrupted. An alternative might be to just use memcache -- but wouldn't that potentially collide with cached entries of the user's app? And depending on how big that data is, won't it fill up my entire memcache?
Looking into
recording.py from the SDK reveals a couple of very interesting implementation details. The first two can be found in the method _save, which stores recorded data for later analysis: def _save(self):
part, full = self.get_both_protos_encoded()
key = make_key(self.start_timestamp)
errors = memcache.set_multi({config.PART_SUFFIX: part,
config.FULL_SUFFIX: full},
time=36*3600, key_prefix=key,
namespace=config.KEY_NAMESPACE)
if errors:
logging.warn('Memcache set_multi() error: %s', errors)
return key, len(part), len(full)
First of all, the data is apparently encoded using protocol buffers. Protocol buffers are not only language-neutral (one could read the binary data out in C++ or Java, for example), their encoding is also designed to produce a very small binary representation (thus minimizing the storage requirements). On top of that, when storing the data to memcache, the programmer sets a
namespace property for the key. Thus, as long as nobody used the same namespace in their code (according to recording.py, it is '__appstats__'), stats written to the store will not overwrite user data.The other interesting aspect is in the
make_key method that produces the memcache key:def make_key(timestamp):
distance = config.KEY_DISTANCE
modulus = config.KEY_MODULUS
tmpl = config.KEY_PREFIX + config.KEY_TEMPLATE
msecs = int(timestamp * 1000)
index = ((msecs // distance) % modulus) * distance
return tmpl % index
It turns out that the sampling size of how much data is held in memcache is very well controlled.
make_key uses the timestamp to create a hash, using a set of division and modulo operations. This way, only a certain amount of values (KEY_MODULUS is 1000) will be remembered. Also, to prevent that a burst of activity over a short period of time eliminates the possibility to watch the samples over a longer time period, requests that come in too close to each other map to the same key.If you have ever used a profiler to find bottlenecks in a desktop app, you know the performance penalty one usually pays for the extra information. Thanks to a couple of very smart design decisions, the penalty for app stats is very small. I am looking forward to reading about testimonials of people using it in the real world :-)
0 comments:
Post a Comment