Monday, February 21, 2011

Choosing An Embedded Database Upgrade Path

An embedded database is one which is tightly integrated with a software application, with new versions of the database often being delivered on the same release cycle as the application. Unlike centralized corporate applications with databases managed by database administrators at head office, applications with embedded databases are often delivered to multiple external sites and are run without any assistance from head office.

The "without any assistance" doesn't just apply to the initial installation and regular operations, it also applies the process of upgrading an application and its embedded database from one version to another. With a corporate database back at head office, changes can be tested, and applied step-by-step under watchful eyes, and even backed out if something goes wrong. Out in the field, however, it's all got to be automated, often as part of an InstallShield setup.exe, and it all has to work no matter what... nobody at head office has the time or inclination to deal with a thousand different installations.

So, let's consider an existing application with a thousand different remote sites running various versions 1 through n, with each version having different schema in its embedded database.

Now, version n + 1 is being constructed with more schema changes...

How should the automated database upgrade be designed?


Door Number 1: Ship code that dynamically alters the schema and data in-place, in the existing database, from whatever old version is currently being used to the new version n + 1 schema.

Door Number 2: Leave the existing database alone, install a new empty database with the new schema, and dynamically alter the data as it is being copied from the old database so it fits the schema in the new database.

Door Number 1: The Alter Schema Approach


One advantage of this approach is that it's often faster to execute; for example, if a large table has the same schema in both the old and new versions then nothing needs to be done. In the other approach, all the data must be copied all the time, even where the schema hasn't changed.

Another advantage is the database space has already been allocated, which is also a disadvantage if the alter-in-place process causes rows to split or other bad things to happen.

Both approaches require at least n blocks of code to handle all the possible transformations. With the Alter Schema Approach the schema and data are most likely transformed from version 1 to 2, 2 to 3, ... n to n + 1. With each new version, a new block must be added but the others probably remain unchanged. The disadvantage here is that passing the data through n transformations may introduce huge inefficiencies resulting in a lot of funky custom code being written for performance.

Door Number 2: The Copy Data Approach


One advantage of this approach is that it's often easier to code an export-transform-load process than an alter-in-place statement, especially if there are large differences between the old and new schema.

Backing out the upgrade is also easier with this approach: just throw away the new database and go back to using the old one, it hasn't been touched.

The Copy Data Approach uses n blocks of code to transform the data from version 1 to n + 1, 2 to n + 1, ... n to n + 1. With each new version, each block of code must be examined and possibly changed, and a new block added. The disadvantage is all the extra work over time, checking and modifying all the transformations since the target has changed for each one. The advantages are that the resulting code tends to be brain-dead simple, and testing is easier because the data only goes through one transformation (and the schema, none at all). Plus, it can be argued that (in the real world) all of the transformation logic in the Alter Schema Approach must be checked and possibly modified with each new version because of those creeping inefficiencies.

The biggest advantage of this approach, however, is that the schema used by the application is exactly the one that was tested and delivered, not some altered version of an earlier schema. For example, ALTER TABLE ADD column statements put the new columns at the end of the table, which isn't a good idea for short, heavily updated columns if earlier columns contain long strings.

That advantage extends into testing: If the target schema remains unchanged regardless of the old version, only the data upgrade process needs to be tested against each old version, not the rest of the application.

On the other hand, if the target schema can be one of n altered versions plus version n + 1 for a fresh installation, the application must be tested against databases that were upgraded from each old version.

Still, It's Hard To Choose


Door Number 2, the Copy Data Approach, was chosen for Foxhound for this main reason: the new schema is delivered and installed as-is, just like the application code; it's isn't patched or altered.

So far it's worked out well. Foxhound has reached n = 7 and while there have been some performance problems the code is simple and virtually bug free... nothing like the wailing and gnashing of teeth I've heard over the years from clients who use the Alter Schema Approach.

Maybe that's the real reason: copying data is easier.

No comments: