Sunday, September 7, 2008

AppEngine and SQLite: storing values in the database (part 2)

Time to implement the put operation! By the end of this article, we will be able to write very simple models into the database. The code for the put-method will not be complete: we will only support integer values and strings for now, no transactions, no lists. Still, it will be a basic skeleton that we can build on.

Now that we know what we are going to do, how are we going to do it? First, let's take a look how datastore_file_stub implements the put method:

def _Dynamic_Put(self, put_request, put_response):
clones = []
for entity in put_request.entity_list():
clone = entity_pb.EntityProto()
clone.CopyFrom(entity)
clones.append(clone)

assert clone.has_key()
assert clone.key().path().element_size() > 0

app = self.ResolveAppId(clone.key().app())
clone.mutable_key().set_app(app)

last_path = clone.key().path().element_list()[-1]
if last_path.id() == 0 and not last_path.has_name():
self.__id_lock.acquire()
last_path.set_id(self.__next_id)
self.__next_id += 1
self.__id_lock.release()

assert clone.entity_group().element_size() == 0
group = clone.mutable_entity_group()
root = clone.key().path().element(0)
group.add_element().CopyFrom(root)

else:
assert (clone.has_entity_group() and
clone.entity_group().element_size() > 0)

self.__entities_lock.acquire()

try:
for clone in clones:
last_path = clone.key().path().element_list()[-1]
kind_dict = self.__entities.setdefault((app, last_path.type()), {})
kind_dict[clone.key()] = clone
finally:
self.__entities_lock.release()

if not put_request.has_transaction():
self.__WriteDatastore()

put_response.key_list().extend([c.key() for c in clones])



Hmmm... looks mighty complicated, doesn't it? Actually, it's not so bad when you summarize it in pseudo-code. It seems that the file-based datastore keeps its entire content in memory, and just writes it to disk to provide persistence in case the program dies or gets killed and restarted. Let's look at it from a higher level:

def _Dynamic_Put(self, put_request, put_response):
for each entity from the put_request:
create a defensive copy called "clone"
make sure that the clone has a valid key
copy the app's name into the key of the clone

if entity has no existing or custom_key
(in other words last_path.id() == 0
and not last_path.has_name()):
create a new numeric primary key for the entity
copy the entity-group from the key's root
else:
make some more data consistency checks

for each of the clones:
put the clone into an
in-memory represenation of the datastore

if we are in a transaction:
write the in-memory representation to disc
else: # (implicit, not in the original code)
do nothing, something else will persist it later

put all the keys from the clones into put_response



While this gives us a basic framework on how to proceed, it also means we do not get a lot of help on how to mode our data store. However, this is the first decision we have to make before we can actually write anything to the database.


Our relational schema


Generally speaking, there are two schools of thought on how the database could be set up. Option number one would be a one-table-fits-all approach:


  • We create one table, let's say "MODELS" that contains one row for each model put into the datastore (with a column describing its type and a primary key).

  • A second table, "PROPERTIES", contains all the properties of a model instance. Each potential data type (integer, string, blob...) would be a separate column in the schema. A separate column would contain the primary key of the model, possibly with a foreign key constraint towards the MODELS-table. When we load a model from the table, we would query for all rows with the same primary key.



This approach has many advantages. It is simple, robust towards schema migration, requires no changes to the datastore when a new model is introducedl. There is only one problem: I don't like it and this is my project, so I'll do something else instead ;-).

I won't go too much into specifics here, but suffice it to say that in a previous life, long ago at another job in a galaxy far, far away, I had my share of trouble with a generic schema like this. Getting it to perform well with external tools, e.g. for reporting, was quite painful. I tend to think that if somebody goes through all the trouble and connect App Engine to a relational database, he or she won't mind the little extra inconvenience of managing a schema.

So, here's the plan: for every different model, I will create a new table (with the same name that the Model has). For every field in the Model, I will create a new row of the proper database type (TEXT for string, INTEGER for numbers and so on). Since I want to make life easier for myself in the future however, I would like to have a tool that creates the initial schema for me, similar to Active Record Migration in Rails. The following unit test will validate that our TestModel is converted into a nice little create-table statement:

class TestModel(db.Model):
text = db.StringProperty(default='some text')
number = db.IntegerProperty(default=42)

#...

def testCreateTabledef(self):
"""Checks if our TestModel can be converted into a valid SQL table."""
helpers.create_tables([TestModel()], self.connection)
cursor = self.connection.cursor()
cursor.execute(
"INSERT INTO TestModel(string_text,int64_number) "
"VALUES ('test', 13)")




So, how does our create_tables method look like? Well, while I was clicking through the debugger to make sense of the low-level APIs, I ran into some helpful methods. _populate_internal_entity() converts a Model-instance into an Entity object (not the protocol buffer, but a class of the same name in the lower-level API). That entity has a method _ToPb() that will finish the conversion into a protocol-buffer. Once we have this protocol buffer, the following code can transform it into a map of key/value pairs:

def entityToDict(pb):
"""Helper, converts an entity protocol buffer to a dictionary.
This method will only deal with the properties of the entity,
not its primary key.

Args:
pb: the entity protocol buffer

Returns:
a dictionary with appropriate kev/value pairs that can
be stored in a SQLite database.
"""
result = {}
#TODO: what about raw properties?
for property in pb.property_list():
#TODO: what about multiple properties?
assert not (property.has_multiple() and property.multiple())
if not property.has_value():
continue
key = property.name()
value = property.value()
if value.has_int64value():
result['int64_' + key] = value.int64value()
elif value.has_stringvalue():
result['string_' + key] = value.stringvalue()
else:
raise 'Not supported yet: %s' % value
return result


As the reader can see, I take the name of the property and prepend its type to create a column name. An integer-property named "foo" would thus be transformed to "int64_foo". As a result, I retrieve a dictionary with the column names as keys and the values for later insertion into the database. The following code takes this information and translates it into a SQL statement that creates a table:

def create_tabledef(model_instance):
"""Create a SQL statement to populate a table from a fully populated Model.

Args:
model_instance: an instance of a model, each field filled with a
non-None value

Returns:
a SQL statement that could be used to create a new table.
"""

# Convert the model into a map of propert key/value pairs
entity = model_instance._populate_internal_entity()
pb = entity._ToPb()
as_dict = datastore_sqlite_stub.entityToDict(pb)

# Translate each key/value pair into a SQL-ish type definition
types = {str: 'TEXT', long: 'INTEGER'}
def convert(value):
key = type(value)
assert 'Cannot convert type %s' % key, key in types
return types[key]
cols = ['%s %s' % (key, convert(val)) for key,val in as_dict.items()]

# Add two more columns for the primary key
cols.append('pk_int INTEGER PRIMARY KEY')
cols.append('pk_string TEXT')

# Concatenate all type definitions into one create statement
return 'CREATE TABLE %s (%s);' % (model_instance.kind(), ','.join(cols))


Notice that I add two additional columns, pk_int and pk_string. Pk_int contains the primary key in the relational database, which is auto-populated if not given in an insert statement. In case a user chooses to use a string-based primary key instead, we store that entry in pk_string. Now that we have this method defined, it is only a matter of applying the statements to the database:

def create_tables(list_of_models, connection):
"""Creates one or more SQL tables from a list of prepopulated models.

Args:
list_of_models: a list of models, each of a different kind
connection: the SQLite connection that should be used
"""
cursor = connection.cursor()
ok = False
try:
for model in list_of_models:
tabledef = create_tabledef(model)
cursor.execute(tabledef)
ok = True
finally:
if ok:
connection.commit()
else:
connection.rollback()



Writing the data


Back to our put-statement. The following unit-tests should verify that the basic operation is successfully implemented:

  def testWriteSingle(self):
"""Writes a single model to the database retrieves it."""
helpers.create_tables([TestModel()], self.connection)
model = TestModel()
key = model.put()
id = key._Key__reference.path().element_list()[-1].id()
cursor = self.connection.cursor()
cursor.execute(
'SELECT string_text, int64_number FROM TestModel WHERE pk_int=%s' % id)
result = cursor.fetchone()
self.assertEquals('some text', result[0])
self.assertEquals(42, result[1])

def testWriteDouble(self):
"""Writes a value into the database twice."""
helpers.create_tables([TestModel()], self.connection)
model = TestModel()
key = model.put()
id = key._Key__reference.path().element_list()[-1].id()
key = model.put()
id2 = key._Key__reference.path().element_list()[-1].id()
self.assertEquals(id, id2)
cursor = self.connection.cursor()
cursor.execute(
'SELECT COUNT(*) FROM TestModel WHERE pk_int=%s' % id)
result = cursor.fetchone()
self.assertEquals(1, result[0])



Now it is time to implement the put. I have done my best to document what I am doing, so the code should be rather self-explanatory. Basically, this is what I do:


  • I iterate through all entities in the request.

  • If the entity has an existing primary key, I execute a DELETE-statement on that key in the database. I do not bother checking if the object already exists; if it doesn't, the delete is simply a NOP.

  • I convert the entity into a dictionary using the previously defined entityToDict.

  • If there was a pre-existing primary key, I add that key to the dictionary.

  • I create an INSERT statement from the dictionary and execute it.

  • If there was no pre-existing primary key, I fetch the newly created row-id from the cursor and consider that the new key.

  • I finish iterating and return the result.



Here is the code that does it all.


  def _Dynamic_Put(self, put_request, put_response):

# Fetch a cursor from the connection
cursor = self.connection.cursor()

# Iterate theough the instances
clones = []
for entity in put_request.entity_list():
# create a defensive copy called "clone"
clone = entity_pb.EntityProto()
clone.CopyFrom(entity)
clones.append(clone)

# make sure that the clone has a valid key
assert clone.has_key()
assert clone.key().path().element_size() > 0
tablename = clone.key().path().element_list()[0].type()

# populate the key's app name with a value
clone.mutable_key().set_app('sql-app')

# Convert the protocol buffer into a dictionary
values = entityToDict(clone)

# If the instance has a primary key, delete the old row
last_path = clone.key().path().element_list()[-1]
if last_path.id() != 0:
cursor.execute('DELETE FROM %s WHERE pk_int=%s' %
(tablename, last_path.id()))
values['pk_int'] = last_path.id()
if last_path.has_name():
cursor.execute('DELETE FROM %s WHERE pk_string=?' %
tablename, [last_path.name()])
values['pk_string'] = last_path.name()

# Using the dictionary of values, create a SQL query
# TODO: how handle list elements
keyval_list = values.items()
key_list = ','.join([first for first, second in keyval_list])
value_list = [second for first, second in keyval_list]
questionmarks = ','.join(['?' for value in value_list])
cursor.execute(
'INSERT INTO %s (%s) VALUES (%s)' %
(tablename, key_list, questionmarks),
value_list)

# Grab the primary key and update the result
if last_path.id() == 0 and not last_path.has_name():
last_path.set_id(cursor.lastrowid)

# Do a database commit
# TODO: how to handle transactions?
if not put_request.has_transaction():
self.connection.commit()

# Populate the response
put_response.key_list().extend([c.key() for c in clones])


Next steps


Well, I've had enough for this weekend, but I might continue this in a few days from now. I'm still looking for a good syntax highlighter, so let me know if you find one. Beyond that, I'd also appreciate some feedback on what to do next. I could either extend the put method by dealing with more data types, list and such, or I can move on to things like get and query and worry about feature completeness later? What's more interesting? Vote by submitting a comment to this post!

3 comments:

Calvin Spealman said...

I fear your approach leaves something to be desired, while your goal is quite desirable, indeed.

Firstly, the layer you are implementing shouldn't know about models at all. Models do not actually exist in the datastore. They exist simply as a helper facility for us to verify the data we're creating. At this level, all that exists are kinds and entities. Kinds are just the name of a type of entity and entities are just a key, the name of their kind, and an arbitrary number of property name to value pairs.

Secondly, I know you are inexperienced with sqlite and I'm afraid to say that it shows, here. A column in sqlite doesn't need a type. The type you provide is actually just a hint for the automated coercion, but any column can be given data of any value type, even if you declared the column of a different type! This should actually mean you can simplify your properties table a lot, because you only need name and value columns.

I would like to help with this project, actually. Would you consider starting a public code repo, maybe on google code?

The App Engine Fan said...

First of all, thanks for the response. I am always happy about feedback. You took your time putting this detailed comment with valuable suggestions together, so I will do my best to answer as good as I can.

I appreciate that you want to help out, although I believe you are not picking the most diplomatic way of bringing yourself into the picture ;-)

A few comments on your points:

> Firstly, the layer you are implementing shouldn't know about models at all.

And it does not. If you re-read the post carefully, you will see that the only place that uses Models is a method called helpers.create_tables. This is in a module called "helpers" for a reason. The datastore implementation itself works entirely on the protocol buffers.

Again, please correct me if you feel I'm missing something here (which could very well be, since, as stated, I'm making all this up as I code along), but be aware that this method is for convenience only. It is a helper that can create a table definition for you. It is useful for the 98% of the people out there who do work on the Models layer and want some help to create their schema, but it is not required. The method create_tabledef can later on easily be refactored to also accept lower-level entities or protocol buffers directly.

> Secondly, I know you are inexperienced with sqlite and I'm afraid to say that it shows, here.

Well, I might not be very experienced with SQLite (which I never claimed I was), but I'm always willing to learn ;-) Heck, before I started working with App Engine, I hardly knew any python (some may argue I still don't ;-).

>A column in sqlite doesn't need a type.

You might very well be right that SQLite does not need types, but other databases such as Oracle or SQL Server do. While this series of articles uses SQLite, it does so mostly because it is by far the simplest relational-ish database to use for these kind of posts. The end result however should not be a one-off for this particular DB; it should be as close as possible to a recipe to do the same kind of thing for any other relational database. Just because SQLite allows me to cut a couple of corners does not mean I should take them. See also the comment below.

> Would you consider starting a public code repo, maybe on google code?

Totally! My plan is to finish this series over the next couple of weeks and then release the code under apache license. The longer term goal for this project should be to connect App Engine to any relational database, not only SQLite, but that's not only out of scope for this blog but also beyond what this full-time-employed and happily married (and intending to stay that way) developer could do.

Please be patient for the next couple of weeks, though. It was hard enough to come up with this idea for the blog series, so I'm going to milk this as long as I can ;-). Seriously though; I'm doing this for fun right now and just want to make some more progress without having to document my "architecture" or whatnot in advance. People have been able to live without this open-source project in the past; they can easily wait another month or so...

Dirk said...

Hi, you are looking for a good syntax highlighter. Take a look at this one http://pygments.org/

I like your approach very much. It would be great for my project named Pyxer http://groups.google.de/group/pyxer

Are you planning to publish your code? You may also join my project on google code if you like. Would be great to have a fast GAE compatible database solution for running without GAE.

Dirk