Saturday, August 30, 2008

App Engine and SQLite

Wow. I'd like to thank everyone who filled out the form and left me feedback on what to write about next. They are very good, and I am glad to say that I now have a little backlog of things to talk about :-) Keep the ideas coming, and thank you so much!

Many suggestions fell into one of the following categories:

  • Beginner's advice: what's a recommended setup? What are "best practices"? What's the best way to transition if you know Rails or insert-your-favorite-framework-here?

  • What cool applications out there exist? How do people put app engine to use?

  • Can you shine light on parts of the framework that are not so well documented?


After some consideration, I have decided (for now) to focus on the third aspect. I do not feel qualified yet to give good advice on IDEs or best practices (I still use jedit and a commandline, although I'd really love to get my Eclipse setup running). As far as cool apps is concerned, I am too lazy to hunt them down (although I won't hesitate to point them out if I ever run into them, see the yuil-article earlier this month). Thus, framework archeology (cue Indiana Jones theme here ;-) it is.

My task for the next couple of weeks is going to be to connect a local (aka running on the devserver) App Engine application to a relational database. For simplicity (I really don't want to set up Oracle or MySQL on my puny little laptop), I have chosen SQLite. My mission is going to be:


  • Find out how the current, file-based connector for the dev appserver ticks.

  • Replace each of its features (saving data, retrieving data, queries, transactions...) with the SQL-based analogon.

  • Take a regular App Engine app and run it against the database.



Now, the reader of this blog might ask him- or herself, why would I do this? After all, even if I get this to run in the dev-appserver, there is no way SQLite could work for a deployed application. While that is certainly true, I would like to point out that there are still a couple of reasons that it's worth the effort. First of all, trying to do this should create some valuable insights on the way the lower-level persistence API happens to work. This knowledge is transferrable to other applications, like decorators around the existing datastore to add additional features, or a more complete integration for other python-based web frameworks. Furthermore, groundwork like this will hopefully show that people do not need to be concerned about lock-in when developing for App Engine. Maybe someday something like this could be integrated with other containers like AppDrop -- who knows... Last but not least, I expect it to be a lot of fun!

So, where do I start? My first seam is an old forum post on how to make an in-memory datastore instance for unit tests. Guess what? Not only happens that class to be Apache open sourced (which my final solution will also be, should I manage to pull this off!), it also ships as part of the standard SDK. I took the liberty of opening the file (datastore_file_stub.py) in a text editor and gut it to build the skeleton for my new implementation:

class DatastoreSqliteStub(object):
""" Datastore stub implementation that uses a sqlite instance."""

def __init__(self, database_name):
"""Constructor.

Initializes and loads the datastore from the backing files, if they exist.

Args:
database_name: the name of the sqlite instance
"""
#TODO: initialize sqlite instance
pass


def MakeSyncCall(self, service, call, request, response):
""" Taken pretty much verbatim from the original datastore_file_stub."""
assert service == 'datastore_v3'
explanation = []
assert request.IsInitialized(explanation), explanation
(getattr(self, "_Dynamic_" + call))(request, response)
assert response.IsInitialized(explanation), explanation


def _Dynamic_Put(self, put_request, put_response):
#TODO: find out what this is good for
pass


def _Dynamic_Get(self, get_request, get_response):
#TODO: find out what this is good for
pass


def _Dynamic_Delete(self, delete_request, delete_response):
#TODO: find out what this is good for
pass


def _Dynamic_RunQuery(self, query, query_result):
#TODO: find out what this is good for
pass


def _Dynamic_Next(self, next_request, query_result):
#TODO: find out what this is good for
pass


def _Dynamic_Count(self, query, integer64proto):
#TODO: find out what this is good for
pass


def _Dynamic_BeginTransaction(self, request, transaction):
#TODO: find out what this is good for
pass


def _Dynamic_Commit(self, transaction, transaction_response):
#TODO: find out what this is good for
pass


def _Dynamic_Rollback(self, transaction, transaction_response):
#TODO: find out what this is good for
pass


def _Dynamic_GetSchema(self, app_str, schema):
#TODO: find out what this is good for
pass


def _Dynamic_CreateIndex(self, index, id_response):
#TODO: find out what this is good for
pass


def _Dynamic_GetIndices(self, app_str, composite_indices):
#TODO: find out what this is good for
pass


def _Dynamic_UpdateIndex(self, index, void):
#TODO: find out what this is good for
pass


def _Dynamic_DeleteIndex(self, index, void):
#TODO: find out what this is good for
pass



So what does this code actually mean? It turns out that there are three layers of datastore APIs in our SDK:
  • the high-level api from google.appengine.ext.db is what we are all used to by now. It has Model classes with data validation, GQL-queries and a lot of other goodness.
  • beneath the high-level api, there is the module google.appengine.api.datastore. This module is what all the high-level stuff gets translated to. Datastore entities are mostly sets of values, and queries are hierarchies of objects rather than an easily human-readable language. Some of the early samples published when App Engine went live (see the wiki example mentioned in this old post) use the datastore. Personally, I think the Models are much nicer :-)
  • the lowest-level is the API shown in the skeleton. All requests to the datastore are compiled into a tuple of objects: a "request" object containing all the input and a "response" object with all the output. Together with details on what method to call, these objects are shoved into a method called MakeSyncCall. Depending on a verb (call) that is passed along with the request, this method will then call one of the "dynamic"-method stubs shown in this source.
Over the next couple of weeks, I will do my best to analyze these methods one by one and translate them into the SQL-equivalents. At the end of this series of articles, I hope to have the ability to run a local app engine app against a database that can also be accessed with other db/reporting tools, populated through sql scripts, easily backed up and manipulated. Keep your fingers crossed...

Sunday, August 24, 2008

Writer's block?

I hate it when this happens: I'm fresh out of ideas what to write about. Well, maybe not so "fresh" -- it's not like I had any clue what to post last week, either. Either way, I guess I currently have no material, and I'm hating it.

Sitting here and staring at my monitor, I figure I have a couple of options. I could simply sit this one out, wait until I have a new idea or something new happens in the app engine world (and yes, I know that the team just extended the datastore API, but everything that could be said about this has been said before). I could also simply extend the scope, go onto miscellaneous rants on languages, programming techniques and such. The problem is, I'm not a very good ranter: like I mentioned before, I do not really give a rat's patoot if Java is superior to Rails or strongly typed languages should be preferred to duck typing. If it was the best tool for the job, I'd go back and do stuff in Turbo Pascal if I had to.

In the end, I decided to let the reader decide: if you have any suggestion what I should be doing in this blog, post it to this thread or fill out the form below. Let me know what interests you. Feedback is always appreciated.

Sunday, August 10, 2008

Syncing your data

In the spirit of openness, I'd like to discuss a topic that pops up every now and then with regards to building app engine apps: how do I get my data out of the cloud? As an application developer, I would like to make sure that I can access my data outside the Google datacenters in any shape or form I like. There are quite a few valid use cases, such as the following examples:

  • I build an online shopping site that takes sales orders and stores them in the database. In order to process the orders later for invoicing, bookkeeping, shipping and handling, I need to get the data out of the cloud and into my ERP system.

  • I build an online blogging site. Once a day, I would like to run a report on all newly submitted articles and send email notifications to subscribed readers.

  • I like the cloud, but I don't trust it. I want a local copy of my data in case I ever decide to host the service in my own datacenter.



The following code shows one technique to achieve these goals. It is not very new or original, and if it has already been described somewhere else, I apologize in advance. Also, since I do not want to reinvent the wheel, I will not come up with an new sample application but reuse the SSN model from the "keeping private data private" and for ways to insult a nose articles. Now on to the new code: the idea is to simply log all modifications made to the store in a separate model (let's call it DataLog):

class DataLog(db.Model):
"""Model represents a modification made to the datastore."""
action = db.StringProperty(required=True)
timestamp = db.DateTimeProperty(auto_now_add=True)
data = db.TextProperty()



The model has three properties: an "action"-field that describes what action has been performed (put for saved data, delete for removed data), a timestamp of when the change was made (warning: these timestamps are not guaranteed to be entirely exact!), and a textual representation of the change to the store. The latter is easy to obtain, since the base Model class conveniently comes with a to_xml method. We create a method patch_for_logging that takes a model-class as an argument and replaces put and delete with decorators that log any change:

def patch_for_logging(modelclass):
"""Patches a model class to log puts and deletes."""
old_put = modelclass.put
old_delete = modelclass.delete
def new_put(self):
result = old_put(self)
DataLog(
action='put',
parent=self.parent(),
data=self.to_xml()).put()
return result
def new_delete(self):
result = old_delete(self)
DataLog(
action='delete',
parent=self.parent(),
data=self.to_xml()).put()
return result
if not getattr(modelclass, 'old_put', None):
setattr(modelclass, 'old_put', old_put)
setattr(modelclass, 'put', new_put)
if not getattr(modelclass, 'old_delete', None):
setattr(modelclass, 'old_delete', old_delete)
setattr(modelclass, 'delete', new_delete)



It should be noted that we set the parent of the log-entry to self.parent(). This is necessary to avoid breaking code that runs in a transaction. Assumed that we save this code in a file called datalog.py, then modifying our original application is as easy as the following lines:

class SSN(db.Model):
user = db.UserProperty()
ssn = db.StringProperty(required=True)
import datalog
datalog.patch_for_logging(SSN)


This will ensure that any change is logged. The only thing that's left is to is to provide a way to access the data. The following sample handler does the trick:

# Very simplified handler that prints out database changes
class MainPage(webapp.RequestHandler):

def get(self):
self.response.out.write('<changes>\n')
for change in DataLog.all():
self.response.out.write(
'<change action="%s" timestamp="%s">\n' % (
change.action, change.timestamp))
self.response.out.write(change.data)
self.response.out.write('</change>\n')
self.response.out.write('</changes>\n')



The following XML is an example how the changelog would look like when requested through the handler:

<changes>
<change action="put" timestamp="2008-08-10 22:03:09.503340">
<entity kind="SSN" key="agNhZWZyCQsSA1NTThgBDA">
<key>tag:aef.gmail.com,2008-08-10:SSN[agNhZWZyCQsSA1NTThgBDA]</key>
<property name="ssn" type="string">12345679</property>
<property name="user" type="user">test@example.com</property>
</entity>
</change>
<change action="put" timestamp="2008-08-10 22:03:34.460330">
<entity kind="SSN" key="agNhZWZyCQsSA1NTThgBDA">
<key>tag:aef.gmail.com,2008-08-10:SSN[agNhZWZyCQsSA1NTThgBDA]</key>

<property name="ssn" type="string">098765432</property>
<property name="user" type="user">test@example.com</property>
</entity>
</change>
</changes>

That was easy, wasn't it? :-)


A couple of disclaimers:
Obviously, the code mentioned above only works when you decorate each model that should be logged. It also requires that the developer sticks to models and does not modify the database by using the underlying lower-level datastore API or the developer console.

The handler that I am using in this example should not be used for production code. It has a couple of serious flaws:
  • it is not secure -- anyone who can guess the URL can get to it. You might want to use a uri signing scheme, similar to the "matter of trust" example.
  • it does not scale, since it always loads all log entries from the database. For production code, you'd want some api that tells the server that certain log entries have been processed and can be deleted from storage (but that's beyond the scope of this post).
In addition you might consider extending the log to have an additional ever-increasing id, possibly created through a global counter. This way, you could use the "ORDER BY" statement in GQL to sort the entries and only retrieve the earliest or latest elements. Like mentioned before, the timestamp is not reliable enough to guarantee elements to stay in exactly the same order -- don't rely on it too much.

If all this seems like too much effort, then another alternative might also be to host your own storage layer, like CouchDB, and let your application connect to it using urlfetch. If you decide to do so however, be sure to use a really fast datacenter to avoid latency. Also, uptime of the overall app will depend on the uptime of your hosted data storage, so you still might want to use the App Engine memcache and/or database for performance and reliability reasons.

Sunday, August 3, 2008

Rumble in the cloud - the web is open!

If you haven't been living under a rock for the last week or so, you have probably heard about the search engine Cuil and its rocky start into the online world. You might have even heard of Yuil -- a visual lookalike that used the Yahoo BOSS APIs to deliver significantly better results with the same layout. But were you aware that this whole thing was written on Google App Engine?

Before everyone out there starts crying HuddleChat, I'd like to point out that the authors have undertaken a major effort to redesign the site to make it look less "cuil-ish". They even thought of the branding and renamed it from "yuil" to "4hoursearch". Why 4hoursearch? Well, let's ask the authors:

It took 4 hours to write the initial code, 4 hours for it to go from unknown to 20 hits / second, 4 hours looking for a domain name and 4 hours to build the brand new UI. Fortunately, it won’t take 4 hours to find something with it
So, why would I (who doesn't even personally work in search) care enough to blog about this? For a couple of simple reasons:
  1. Not only did the authors "claim" that doing the port was easy, they also provide the proof: you can download the source code and study it. It is a beautiful little piece of code. I probably couldn't have done it in this timeframe, but I can see how someone familiar with the APIs could. Kudos to the engineering work!
  2. From the first launch of App Engine on, there has been a lot of discussion on the matter of lock in and whether using this environment would force developers to do it the "Google way" and never be able to switch to another platform. 4hoursearch is a great example on how pieces in the cloud can be woven together, interact and enhance each other. This service is using Google infrastructure to perform Yahoo searches -- how more open could the cloud be?!?
  3. When I originally started this blog, I wanted to focus on how App Engine can be used to develop commercially successful apps with little rampup time or cost. Now, I might have strayed from that path a little bit, but this opportunity is just too good to pass up. If this would not have been a Yahoo executive but a regular garage startup guy, he or she could have promoted the heck out of this and make good money just through ad revenue. Instead, lots of little cookiecutter enterpreneurs now have a perfect template for building their own search engines -- it will be interesting to see how they will make use of it.
Well, so much for this week's post -- time to get out of the virtual cloud and catch a few more rays of the warm California sun! If you are interested to learn more about open standards and how they are be used to keep the web open, you might want to check out the Open Web Foundation.