Fixing lousy introspection support in Python DB APIs (partially)

For my internship at Think Wize, I’ve been working on a custom Django management command that compares the current model definitions (models.py) and the current database schema, and tries to spot the differences. I’ve called it checkdb. Optionally it can also attempt to fix them, with checkdb --fix.

syncdb is not agile

Django’s syncdb management command, used to synchronise the database with the model definitions, is often criticised because it refuses to alter already existing tables. Django’s development team stands by this. They reason that altering the tables to match changed models is a very complex and ambiguous process, which is too risky to be completely automated. They are probably right. There are so many parameters to take into account. In addition, it is basically impossible to tell whether a field has been changed, or whether a field has been removed and another one added.

This means that, when you change your models during development (and you most likely will, frequently), syncdb is of no use to update your database schema accordingly. Django’s documentation suggests that you use the sql command, which prints out the SQL that you could use to create the entire database schema matching the current model definitions from scratch. It is then up to you to distill some ALTER TABLE statements out of this.

What we like so much about web development frameworks is, amongst other things, the abstraction they offer in terms of persistence: Object-relational mapping, or ORM for short. In layman’s terms, we don’t have to write SQL queries, the framework writes them for us. We can just write our code in an object oriented fashion, where each entity in the application’s domain model is represented by a model object. Usually it suffices to call the save method on these objects to persist them. Most of the time this works great. Some ORM toolkits go pretty far in optimising the generated queries, too. But I digress.

Django - A web development framework

Django - A web development framework

Django has no manners

All of a sudden, though, Django expects us to write DDL queries. From scratch, pretty much. How rude! It is inevitable that a database schema evolves during the development process, and it probably changes quite often too. We would very much like to avoid having to write queries each time. Many people have tried to find more elegant solutions for this, which gave rise to database migrations.

Migrations are cool

In essence, migrations are versioned scripts that define elementary schema changes, preferably one at a time (e.g. “add the column ‘email_address’ to table ‘user’”). They can be SQL scripts, but very often they are written using an additional abstraction layer (which sits in between SQL and ORM toolkits) that makes them database-agnostic. Preferably, a migration also specifies how to revert the changes it makes. The versioned nature of these scripts makes it possible to quickly bring the database schema up to date, or revert it to an earlier version; in most cases without data loss. They can also be used to recreate the schema elsewhere: in another database, one that is used solely for testing for example, or even on another database engine, if they are database-agnostic.

Ruby on Rails comes with migrations baked in. You don’t have to use them, but you’d be crazy not to. It provides a simple API (or DSL, if you’re one of those guys ;) ) to make schema changes independently of the database engine used. Django doesn’t come with anything of the sort. There are quite a few plugins that take care of this however, most notably django-evolution and South. As I have already mentioned in an earlier post, the past few weeks I’ve been seeing quite a bit of South1.

South - intelligent schema migrations for Django

South - intelligent schema migrations for Django

South is awesome

South has one killer feature that sets it apart from most other similar tools: --auto. Whenever you create a migration with South, it “freezes” all your model definitions into the migration file. It isn’t a pretty sight, but it’s very, very useful; it makes it possible to detect changes to models automatically! This means that in most cases, South can write your migrations for you. All you have to do is review them and correct them if necessary. This holds the middle ground between Django’s radical “I ain’t touchin’ that” approach and automatically performing all changes. By introducing the possibility to review the changes, you get the best of both worlds.

checkdb --fix is evil

Automatically performing all changes is basically what checkdb --fix, the script I implemented, does. Usually, this is a very bad idea. In fact, we are already going to use South, so you would think there is no need for such a thing. However, checkdb is intended to be used as a diagnostic tool, and as a “last resort” option if migrations don’t cut it, for whatever reason. One important use case for checkdb is to verify that the database schema is correct before running any tests. This is, as far as I know, primarily what it will be used for. I used South’s excellent database API to implement it, in order for it to be as database-agnostic as possible.

Unfortunately, I am not at liberty to release the code into the wild (at least I think I’m not). I can, however, discuss a an issue I ran into while writing it. One of the biggest problems I encountered was the lousy introspection support in most Python DB API implementations. Database introspection is pretty essential to get a good view of what the schema looks like.

DB API implementations are lousy

The Python DB API is a specification that many Python database APIs adhere to, in order to provide uniform access. Alas, once you start writing queries this becomes a problem again, because each database engine has its own dialect of SQL. Anyway, what a DB API implementation should look like is specified in PEP 249. For introspection purposes, it defines a cursor.description attribute, which describes the structure of the result columns that were returned by the last query performed with the specified cursor. For each column, seven values are supplied: (name, type_code, display_size, internal_size, precision, scale, null_ok). The values in question are set to None if they are irrelevant for the given column, or if they are not implemented.

The latter seems to happen rather a lot. Sometimes the implementations are even plain wrong, like MySQL’s, which returns an invalid internal_size for VARCHAR columns. SQLite’s only implements the first value (name)! postgresql_psycopg2 is actually the “best” implementation, but it doesn’t implement null_ok, which specifies whether the given column is nullable, and which is really quite important for checkdb. Actually, I haven’t run into any implementations that return a sensible value for null_ok. How hard can it be to make this work? Anyway, this meant I had to find a workaround.

Standards are useful…

While googling, I discovered something called INFORMATION_SCHEMA, which is part of the SQL-92 standard. Basically, it’s a “virtual” database that contains metadata about the databases themselves. This kind of feature is called a database catalog, and most database engines support it. But until recently, each had its own way of structuring this information. MySQL supports INFORMATION_SCHEMA since version 5, PostgreSQL has supported it since version 7.4. There are a couple of others, but these two were the most relevant for my work.

I got kind of lucky, because a fix for MySQL’s erroneous internal_size had already been written. What I had to implement was null_ok. The solution was to run the following query:

SELECT is_nullable FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'tbl_name' AND column_name = 'col_name' AND table_catalog = 'db_name';

In PostgreSQL, this returns YES or NO, which I then converted to a boolean and inserted in to the tuple returned by cursor.description, for the specified column.

…if they are unambiguous

In MySQL however, this does not work. So much for standards. You have to use this instead:

SELECT is_nullable FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'tbl_name' AND column_name = 'col_name' AND table_schema = 'db_name';

Note the subtle difference; in MySQL, the database name is stored in table_schema, whereas it is stored in table_catalog in PostgreSQL. Note that both columns exist for both database engines, but the other always contains a worthless value; table_schema is always set to public in PostgreSQL, for example. There is probably a very good reason for this. Right? :-|

I thought I’d do a write-up of this, because database introspection seems to be neglected very often in Python, both in terms of implementation and documentation. Let me know if it helped you!

On a different note, I fixed a slight cosmetic issue with article titles on this blog. They should no longer overlap with the dates, even if your resolution is lower than 1680×1050 8)

Notes

  1. ↑1 While I’m at it, I should mention that Andrew Godwin, South’s creator, has been really helpful and patient with me for the past few weeks. Thanks dude! Check out his personal site at www.aeracode.org.

2 Comments

  1. Posted October 31, 2009 at 12:59 am | Permalink

    You mention that a fix for MySQL’s erroneous internal_size has been written. Would you mind sharing the details of that information? I am tracking down some details to fix a Django bug that affects introspection and I think that information might be useful. I’m assuming it is a patch to MySQLdb, but is it in a release version, or only in trunk?

    Thanks for any help you can give me.

  2. Sander
    Posted October 31, 2009 at 1:53 am | Permalink

    The fix in question is very similar to the one for is_null that is detailed here: an information schema query is used to “correct” the data. Note that my choice of words may have been a bit inappropriate; the API doesn’t really define how the returned data should be interpreted, so one can’t really argue that it is “wrong” in this case. It is, however, unintuitive and not very useful :)

    The query in question is:

    SELECT character_maximum_length FROM INFORMATION_SCHEMA.COLUMNS
    WHERE table_schema = 'db_name' AND table_name = 'tbl_name' AND column_name = 'col_name'

    This should return the value that any reasonable person would expect for internal_size on a VARCHAR field.

    In the project I described, I wrote a wrapper function that calls cursor.description and then applies a number of fixes by running queries on INFORMATION_SCHEMA if necessary (for both MySQL and PostgreSQL). It then returns the “corrected” tuple.

    I don’t know if this is the best way to retrieve this kind of information, but it works, at least :) I hope this helps. If it doesn’t, don’t hesitate to tell me and I’ll see what I can do.

Post a Comment

A subset of HTML is supported. Please use <code> and <pre> for code samples. You can enable syntax highlighting like this: <pre lang="ruby">. This also works for Javascript, Python, and probably most other languages you can think of.

Your email is never published nor shared. Required fields are marked *

*
*