The problem stems from a design decision I made early on when deciding on how to model entities in an RDBMS:
Generally speaking, there are two schools of thought on how the database could be set up. Option number one would be a one-table-fits-all approach:
- We create one table, let's say "MODELS" that contains one row for each model put into the datastore (with a column describing its type and a primary key).
- A second table, "PROPERTIES", contains all the properties of a model instance. Each potential data type (integer, string, blob...) would be a separate column in the schema. A separate column would contain the primary key of the model, possibly with a foreign key constraint towards the MODELS-table. When we load a model from the table, we would query for all rows with the same primary key.
This approach has many advantages. It is simple, robust towards schema migration, requires no changes to the datastore when a new model is introduced. There is only one problem: I don't like it and this is my project, so I'll do something else instead ;-).
The reason I chose not to use this approach is that I consider it very hard to do non-appengine operations on such a store, such as visualizing entity relationships or creating reports. Therefore, I chose to create a new table for each type of entity. If, for example, we stored an entity called Foo, I would create a table with the same name. For every property I intended to store, I would create a new column. Columns were named as a combination of the type that would be stored within and the name of the property. So, if I was to store integer values for a property called Bar, this would go into the column named int64_Bar. I would also create two additional columns pk_string and pk_int to store the primary key of an entity.
While this approach created a relatively nice database schema, it came with a couple of issues. Mostly, it forced the user to define his or her database schema in advance. Some of the beauty of the App Engine datastore is that this is not really necessary -- properties may even be added on the fly, as shown in the Expando models. To mimic this flexibiliy, I wanted the sql connector to
- Create a new table for an entity that did not exist yet.
- Add new column definitions if required.
Fortunately, SQL provides the possibility to do so, using the CREATE TABLE and ALTER TABLE statements. I even had most of the logic already implemented for my unit tests; I just needed to refactor and generalize it in the PRMHelper.
First, I created a helper that would decide what SQL data type should be used for a particular python type:
def __init__(
self, get_connection, release_connection):
"""Constructor.
Args:
get_connection: a parameterless function that provides a new
connection from the pool
release_connection: a function that accepts a connection
and puts it back into the pool
"""
self.get_connection = get_connection
self.release_connection = release_connection
self.__types = { str: 'TEXT', long: 'INTEGER',
int: 'INTEGER', float: 'DOUBLE'}
def __suggestRows(self, sample_dict):
"""Suggests a list of row definitions for this SQL dialect.
Args:
sample_dict: a dictionary of column names and sample
entries (required to determine the sql column type
that should be used.)
Returns:
a list of strings of form "column_name type" that
could be used for a sql create table statement.
"""
result = []
for key, val in sample_dict.items():
value_type = type(val)
assert value_type in self.__types, ('Cannot '
'convert type %s' % value_type)
result.append('%s %s' % (key, self.__types[value_type]))
return result
This method would take a key/value pair like (int64_Bar, 3) and convert it into a string int64_Bar INTEGER. These strings are the base component that make up the schema definition of a particular table. There was already another helper called getSchema that would produce a set of existing columns for a particular table. All I now had to do is compare the existing schema with the rows that should be written, compute the difference and convert that either into a CREATE TABLE or ALTER TABLE statement:
def suggestMutation(self, connection, table_name, sample_dict, add_rowz=True):
"""Suggests an sql statement that mutates a db schema.
Given a table name, this method will check if the table exists
and if a given set of column names exist. If the table does
not exist, it will return a CREATE TABLE statement that
can be executed. If the table does exist but one or more
column names do not exist, it will return a statement to
modify the table instead.
Args:
connection: the sql connection that should be used to
gather any metadata
table_name: the name of the table that should be validated.
sample_dict: a dictionary of column names and sample
entries (required to determine the sql column type
that should be used.)
add_rowz: if set to true, it will create any "standard" columns
(for primary keys and such) upon table creation.
True is default.
Returns:
a list sql statements that can be executed to perform the mutation
"""
# Check if the table exists and if so, eliminate duplicate rows
new_rows = dict(sample_dict)
new_table = False
current_schema = self.getSchema(connection, table_name)
if current_schema is None:
new_table = True
else:
for columns in current_schema.values():
for column in columns:
new_rows.pop(column, None)
new_rows.pop('pk_string', None)
new_rows.pop('pk_int', None)
# If there are no new rows, we are done
if not new_table and not len(new_rows):
return []
# Translate each key/value pair into a SQL-ish type definition
# and two more columns for the primary key, if necessary
snippets = self.__suggestRows(new_rows)
if new_table and add_rowz:
snippets.append('pk_int INTEGER PRIMARY KEY')
snippets.append('pk_string TEXT')
# Convert the snippets into a proper statement
if new_table:
return ['CREATE TABLE %s (%s);' %
(table_name, ','.join(snippets))]
else:
return ['ALTER TABLE %s ADD COLUMN %s' %
(table_name, snippet) for snippet in snippets]
All that was now left to do was to extend the database put to check for schema modifications and apply them if necessary. For the concrete implementation, you can check out the source code on code.google.com.
How well does this approach work? It passes all the unit tests, but it is certainly not ideal. Since it is running before every database put, this means that every entry into the database needs to check the schema for necessary modifications. Since we do not cache the schema definition anywhere, the result is additional database load and a slight performance hit (I have not really measured this, though). I am not to worried about it for the moment, but once the connector is feature complete, this is something that should definitely be tweaked. If anyone would like to take a shot at it, please do not hesitate to let me know.
3 comments:
This is not the right way to do this. In your design when a property is not set, it will be NULL — on the real datastore it will not exist (not even as None). How are you going to index ListProperties?
The real way to do this, and the way it works under the covers on Google's end, is to store each entity-attribute-value triple as its own 'row'.
> This is not the right way to do this [...] The real way to do this, and the way it works under the covers on Google's end, is to store each entity-attribute-value triple as its own 'row'.
Always depends on your objective. My objective is not to dump data in the rdbms somehow, but to do it in a way that makes the data accessible and easily usable outside the App Engine app. Otherwise, I would have used Couch Db. Have you ever tried to run a reporting tool on a generic schema like the one you recommend? How many table joins would you need to create views that work well with Crystal?
> In your design when a property is not set, it will be NULL — on the real datastore it will not exist (not even as None).
Interesting point -- that's something that should be addressed. Can you post a unit test that demonstrates how this poses a problem (so that we can make sure that the solution actually fixes the issue)? Do you know by any chance if "None" is a permitted value for the datastore? If not, I could simply filter out those fields. As an alternative, I could also do what the protocol buffers do and have a separate column "has_fieldname" for each column.
> How are you going to index ListProperties?
Haven't dug too deep into it. My current thoughts on it are:
introduce a new column "type" fk (like int64_Bar, fk_Bar) that stores a foreign key to a list-table. list elements and their position in a list are then stored in that table. When a GQL query is performed, the connector would need to extract from the schema that a foreign key constraint exists and perform additional database queries to get to that information.
fwiw -- I added unit tests to the project that validate whether the datastores behave consistently regarding "None" (see http://code.google.com/p/gae-sqlite/source/diff?spec=svn8&r=8&format=side&path=/trunk/unittests.py ). Currently, that is not the case. I'll have to fix that at some point.
Post a Comment