Saturday, November 29, 2008

Implementing schema evolution

Today, I am going to write about the implementation of schema evolution. Two blog posts ago, I posted a couple of unit tests that demonstrated what happened if a new model got added to the database (or a new field to an existing model).

The problem stems from a design decision I made early on when deciding on how to model entities in an RDBMS:

Generally speaking, there are two schools of thought on how the database could be set up. Option number one would be a one-table-fits-all approach:

  • We create one table, let's say "MODELS" that contains one row for each model put into the datastore (with a column describing its type and a primary key).

  • A second table, "PROPERTIES", contains all the properties of a model instance. Each potential data type (integer, string, blob...) would be a separate column in the schema. A separate column would contain the primary key of the model, possibly with a foreign key constraint towards the MODELS-table. When we load a model from the table, we would query for all rows with the same primary key.

  • This approach has many advantages. It is simple, robust towards schema migration, requires no changes to the datastore when a new model is introduced. There is only one problem: I don't like it and this is my project, so I'll do something else instead ;-).


The reason I chose not to use this approach is that I consider it very hard to do non-appengine operations on such a store, such as visualizing entity relationships or creating reports. Therefore, I chose to create a new table for each type of entity. If, for example, we stored an entity called Foo, I would create a table with the same name. For every property I intended to store, I would create a new column. Columns were named as a combination of the type that would be stored within and the name of the property. So, if I was to store integer values for a property called Bar, this would go into the column named int64_Bar. I would also create two additional columns pk_string and pk_int to store the primary key of an entity.

While this approach created a relatively nice database schema, it came with a couple of issues. Mostly, it forced the user to define his or her database schema in advance. Some of the beauty of the App Engine datastore is that this is not really necessary -- properties may even be added on the fly, as shown in the Expando models. To mimic this flexibiliy, I wanted the sql connector to

  • Create a new table for an entity that did not exist yet.

  • Add new column definitions if required.



Fortunately, SQL provides the possibility to do so, using the CREATE TABLE and ALTER TABLE statements. I even had most of the logic already implemented for my unit tests; I just needed to refactor and generalize it in the PRMHelper.

First, I created a helper that would decide what SQL data type should be used for a particular python type:

  def __init__(
self, get_connection, release_connection):
"""Constructor.

Args:
get_connection: a parameterless function that provides a new
connection from the pool
release_connection: a function that accepts a connection
and puts it back into the pool
"""
self.get_connection = get_connection
self.release_connection = release_connection
self.__types = { str: 'TEXT', long: 'INTEGER',
int: 'INTEGER', float: 'DOUBLE'}

def __suggestRows(self, sample_dict):
"""Suggests a list of row definitions for this SQL dialect.

Args:
sample_dict: a dictionary of column names and sample
entries (required to determine the sql column type
that should be used.)
Returns:
a list of strings of form "column_name type" that
could be used for a sql create table statement.
"""
result = []
for key, val in sample_dict.items():
value_type = type(val)
assert value_type in self.__types, ('Cannot '
'convert type %s' % value_type)
result.append('%s %s' % (key, self.__types[value_type]))
return result


This method would take a key/value pair like (int64_Bar, 3) and convert it into a string int64_Bar INTEGER. These strings are the base component that make up the schema definition of a particular table. There was already another helper called getSchema that would produce a set of existing columns for a particular table. All I now had to do is compare the existing schema with the rows that should be written, compute the difference and convert that either into a CREATE TABLE or ALTER TABLE statement:

  def suggestMutation(self, connection, table_name, sample_dict, add_rowz=True):
"""Suggests an sql statement that mutates a db schema.

Given a table name, this method will check if the table exists
and if a given set of column names exist. If the table does
not exist, it will return a CREATE TABLE statement that
can be executed. If the table does exist but one or more
column names do not exist, it will return a statement to
modify the table instead.

Args:
connection: the sql connection that should be used to
gather any metadata
table_name: the name of the table that should be validated.
sample_dict: a dictionary of column names and sample
entries (required to determine the sql column type
that should be used.)
add_rowz: if set to true, it will create any "standard" columns
(for primary keys and such) upon table creation.
True is default.
Returns:
a list sql statements that can be executed to perform the mutation
"""

# Check if the table exists and if so, eliminate duplicate rows
new_rows = dict(sample_dict)
new_table = False
current_schema = self.getSchema(connection, table_name)
if current_schema is None:
new_table = True
else:
for columns in current_schema.values():
for column in columns:
new_rows.pop(column, None)
new_rows.pop('pk_string', None)
new_rows.pop('pk_int', None)

# If there are no new rows, we are done
if not new_table and not len(new_rows):
return []

# Translate each key/value pair into a SQL-ish type definition
# and two more columns for the primary key, if necessary
snippets = self.__suggestRows(new_rows)
if new_table and add_rowz:
snippets.append('pk_int INTEGER PRIMARY KEY')
snippets.append('pk_string TEXT')

# Convert the snippets into a proper statement
if new_table:
return ['CREATE TABLE %s (%s);' %
(table_name, ','.join(snippets))]
else:
return ['ALTER TABLE %s ADD COLUMN %s' %
(table_name, snippet) for snippet in snippets]


All that was now left to do was to extend the database put to check for schema modifications and apply them if necessary. For the concrete implementation, you can check out the source code on code.google.com.

How well does this approach work? It passes all the unit tests, but it is certainly not ideal. Since it is running before every database put, this means that every entry into the database needs to check the schema for necessary modifications. Since we do not cache the schema definition anywhere, the result is additional database load and a slight performance hit (I have not really measured this, though). I am not to worried about it for the moment, but once the connector is feature complete, this is something that should definitely be tweaked. If anyone would like to take a shot at it, please do not hesitate to let me know.

Saturday, November 22, 2008

Sneezing Panda

Not a lot going on this week -- I am mostly out with a nasty cold.

My original plan of implementing schema evolution this weekend will have to wait a little longer -- right now, I'm happy if I don't sneeze out parts of my brain into a hankerchief ;-) Until I have something more interesting to blog, let me just mention a couple of interesting tidbits:



Have a great week everyone, and don't forget your flu shots!

Sunday, November 16, 2008

Wanted: Intelligent Design for Schema Evolution

A few days ago, I published a snapshot of the "SQLite bindings for Google App Engine." Now that the code is out, I had to decide what feature to work on next. I started with supporting a couple of more primitive types in the schema, but that didn't really seem to interesting from a blogging perspective. Thus, I have decided to tackle the idea of Schema Evolution next.

Right now, one cannot simply connect to database and start putting models in -- one has to define a table for each kind with the right columns to support our current primitives. The method "create_tables" in helpers.py does that job for us. Unfortunately, this is not very convenient compared to the real thing in App Engine, because
  • I cannot simply create a new Model and store it in the database
  • I cannot simply add a new Property to and existing model
  • I cannot store different types under the same property name
This is something I would like to change, so I created a couple of unit tests that describe the very behavior I am looking for:

  def testNewModel(self):
"""Tests what happens if a new kind of Model gets stored."""
class UnknownKind(TestModel):
pass
model = UnknownKind()
model.put()

def testAddedField(self):
"""Tests what happens if a field gets added to a model."""
helpers.create_tables([TestModel()], self.connection)
class Mutation(TestModel):
@classmethod
def kind(cls):
return 'TestModel'
text2 = db.StringProperty(default='some more text')
model = Mutation(text='Text 1', text2='Text 2')
model.put()

def testTypeChange(self):
"""Tests what happens if a field changes its type."""
helpers.create_tables([TestModel()], self.connection)
class Mutation(db.Model):
@classmethod
def kind(cls):
return 'TestModel'
text = db.IntegerProperty(default=42)
model = Mutation(text=23)
model.put()


Right now, each of these tests is going to fail, as shown here:

======================================================================
ERROR: Tests what happens if a field gets added to a model.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jens/appengine/sqlite/src/unittests.py", line 176, in testAddedField
model.put()
File "/home/jens/appengine/google_appengine/google/appengine/ext/db/__init__.py", line 618, in put
return datastore.Put(self._entity)
File "/home/jens/appengine/google_appengine/google/appengine/api/datastore.py", line 160, in Put
apiproxy_stub_map.MakeSyncCall('datastore_v3', 'Put', req, resp)
File "/home/jens/appengine/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 46, in MakeSyncCall
stub.MakeSyncCall(service, call, request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 104, in MakeSyncCall
(getattr(self, "_Dynamic_" + call))(request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 154, in _Dynamic_Put
value_list)
OperationalError: table TestModel has no column named string_text2

======================================================================
ERROR: Tests what happens if a new kind of Model gets stored.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jens/appengine/sqlite/src/unittests.py", line 165, in testNewModel
model.put()
File "/home/jens/appengine/google_appengine/google/appengine/ext/db/__init__.py", line 618, in put
return datastore.Put(self._entity)
File "/home/jens/appengine/google_appengine/google/appengine/api/datastore.py", line 160, in Put
apiproxy_stub_map.MakeSyncCall('datastore_v3', 'Put', req, resp)
File "/home/jens/appengine/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 46, in MakeSyncCall
stub.MakeSyncCall(service, call, request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 104, in MakeSyncCall
(getattr(self, "_Dynamic_" + call))(request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 154, in _Dynamic_Put
value_list)
OperationalError: no such table: UnknownKind

======================================================================
ERROR: Tests what happens if a field changes its type.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jens/appengine/sqlite/src/unittests.py", line 187, in testTypeChange
model.put()
File "/home/jens/appengine/google_appengine/google/appengine/ext/db/__init__.py", line 618, in put
return datastore.Put(self._entity)
File "/home/jens/appengine/google_appengine/google/appengine/api/datastore.py", line 160, in Put
apiproxy_stub_map.MakeSyncCall('datastore_v3', 'Put', req, resp)
File "/home/jens/appengine/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 46, in MakeSyncCall
stub.MakeSyncCall(service, call, request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 104, in MakeSyncCall
(getattr(self, "_Dynamic_" + call))(request, response)
File "/home/jens/appengine/sqlite/src/datastore_base.py", line 154, in _Dynamic_Put
value_list)
OperationalError: table TestModel has no column named int64_text

----------------------------------------------------------------------


My goal for the next post in this blog is going to be to make these tests pass! That is, of course, unless someone out there gets it right first!

Does anyone have an idea how to get the job done? Code it and post the code for review. Put a comment at the end of this post, and I'll be more than happy to take a look at your result. I cannot guarantee fortune and fame, but it might be a fun project to work on :-) Just give it a shot!

(P.S: The unit tests from this blogpost are already submitted to the project)

Wednesday, November 12, 2008

It's alive

Where should we be if nobody tried to find out what lies beyond? Have you never wanted to look beyond the clouds and the stars or to know what causes the trees to bud and what changes the darkness into light? But if you talk like that, people call you crazy. But if I could discover just one of these things, what eternity is, for example, I wouldn't care if they did think I was crazy!
Frankenstein


Last month, I wrote I would not post the sql connector's source code until it was a little more functional. Guess what? I changed my mind: you can find all the code at http://code.google.com/p/gae-sqlite/.

I have not posted in a while (four weeks, to be precise). I'd like to claim that this was because I was busy implementing all the missing features in this project, but that would be a lie ;-). I was involved in launching Labs for Google Apps, a set of App Engine based services for enterprise users. The last weeks before a launch are a great experience, but they also require focus -- in other words, I put my hobbies (including this blog) on hold for a while.

After the launch, I took some time off to recharge the batteries. Now that I'm back from vacation, I will continue to work on this. However, I also suddenly realized that a month had passed without any updates, while some people are waiting to see how this is doing. That wasn't very fair, and I apologize. To prevent this from happening in the future, I uploaded the sources to Google Code. This way, you guys can always get to the latest version. Also, if you find any bugs or would like to make improvements, feel free to let me know.