Jul 26, 2024
by Jacek Migdal
Have you received a notification that your banks online services will be down over the weekend? This is likely because some engineers will stay overnight to run a database schema migration. If they're lucky, it’s just a major productivity slowdown. Still, they must schedule changes months in advance and sacrifice their weekend.
However, it might turn into a catastrophe if they encounter issues while making data schema changes. For example, a major Italian bank was down for a week. This issue isn’t limited to large, established corporations. Linear, a savvy startup, had an outage and lost some data. Even GitHub, which has one of the best tooling around MySQL, went through an interruption of service.
Those blackouts have burned many seasoned engineers and DevOps professionals. Most of the time, they are an underreported nightmare. They instill a fear of improving or changing things, which creates chaos, inefficiencies, and performance issues.
The problem is so painful that engineers are willing to change platforms to avoid it. MongoDB won a huge loyal fan base in the transactional space because it eliminated the need for schema. In the analytical space, engineers at Netflix created Apache Iceberg to overcome Hive's poor schema evolution story. For example, adding a column could bring zombie data into your table.
What is schema migration?
SQL databases enforce the structure of your data: tables, column names, and types. So, when you add a new data record such as purchase_order, you ensure it references the existing product_id, buyer_id, got data, and all parameters.
If you need to make changes (e.g., add a tax code, AI training consent, embedding for LLM RAG), you must change your schema by writing DDL (Data Definition Language) SQL. After you complete the change, the schema migration might rearrange existing data and change how queries run.
Before
After
Typical migration problems
1.Hard To Test Outside Of Production
The production database usually contains significantly more data and a greater variety of data than any other environment. It’s where you encounter rare corner cases (e.g., exotic long names, forgotten workarounds) that need to be reflected in your tests.
2.Difficulty in making changes
The database sometimes becomes fragile, making it very hard to add any new features, even if necessary. This often leads to over-engineering, such as adding new microservices when simply adding a few columns could suffice.
Additionally, you still need to JOIN data across microservices, but now you have to implement endpoints yourself and perform the JOIN at the application layer—resulting in a lot of spaghetti code with little business value.
3.Impact on multiple areas
Schema migration affects the database and all its integrations, such as the ETL job moving data to the warehouse and client applications. All of these places need to be compatible with both versions. If they are not, something will break, and you’ll discover it the hard way. Databases are frequent integration points with undocumented use cases.
4.Lengthy process
Migration might take hours for production datasets. This allows for plenty of accidents to occur in the middle, and taking any action is like steering an oil tanker with a lot of inertia.
5.No easy revert
In many schema migration frameworks, engineers manually write up() and down() methods. The revert is rarely tested and may cause even more damage. Moreover, restoring from backups is also rarely tested. The backup may miss some data and take a lot of time.
For example, when a dev accidentally deleted a database at GitLab, they discovered that none of their backups actually worked.
6.Workarounds due to lack of migrations
When changing the schema is difficult, engineers find workarounds. For instance, the “description” field might be reused to contain serialized JSON with settings. This creates technical debt that causes problems down the line.
As a result, data engineers spend most of their time gathering and cleaning data. It would be easy to have a “customer_uuid” column across many systems in UUID v7 format, but currently, they have to spend a lot of time fixing mistakes and inaccuracies instead of deriving insights.
Typical mitigation risk tactics
1. Plan and document
Start with a clear plan. Outline the current state of the database, the desired end state, and the steps to get there. Documentation helps everyone stay on the same page and makes troubleshooting easier.
2. Manage schema as code and use tools
Tools like Liquibase and Bytebase support both state-based and migration-based approaches. They offer automated processes, version control, and safety checks.
3. Test extensively
Test in environments that closely mimic production. Ideally, also test rollback. Automation can help, but also pay attention to performance metrics during the migration (e.g., query response times).
4. Test backup recovery
In some cases, you may restore your database from backup. Run this procedure before running migration to verify that your backups work. Take another backup shortly before the migration.
5. Perform migrations during Off-Peak Hours
Do them at night when there is little traffic, and you can afford downtime.
6. Review and approve changes
A seasoned engineer with database experience must review all schema migrations. For some changes, a change management board with a detailed plan documenting the steps and a few sign-offs may be required.
7. Monitor and Optimize Performance
Use monitoring tools to track the impact of schema changes. Even after a long-awaited change, be proactive about performance, as issues may appear days after the migration. Have alerting set up on latencies (p95, p99).
Though those procedures have side effects, they make it very expensive and slow. A feature that might take a week might take months to get the schema right. It will encourage making many changes at once, making troubleshooting hard. Off-peak hours usually also mean fewer people will be available, and they will be less efficient at troubleshooting in the middle of the night.
State-of-the-art mitigation tactics
PlanetScale wrote an article about comparing Online Schema Change tools. Similarly, tools exist for PostgreSQL, such as Reshape.
They all rely on creating shadow tables/instances where you perform the actual change and later promote it to primary. This is an excellent step in the right direction.
Why is even state-of-the-art not good enough, and how can we address it?
Almost all are hard to use and still have some rough edges (e.g., reported failure). Most were designed for the most prominent companies with relatively clean environments. Many can be used only as part of their platform (e.g., Vitess), which requires many changes to their database before they can start using it (e.g. Figma wrote its proxy to avoid that).
The most problematic part is that no one controls the query and can rewrite it, so even if you executed your change perfectly in the database, some old clients might still send problematic queries. So inherently, they have many limitations:
Renaming columns and tables (e.g., GoCardless) is impossible.
Many other changes are impossible, and it’s hard to determine which ones are possible.
Even if you spot a potential easy-to-fix problem, there is nothing you can do about it without upgrading the client that is another team's responsibility.
Quesma schema migration vision
Changing the schema on the production database and live traffic should be easy with zero downtime. You should be able to preserve backward compatibility for all clients, even if they have hardcoded assumptions about schema and can’t easily upgrade them.
Just as blue/green deployments are common on the application layer (e.g., Kubernetes, Microservices, serverless), a similar approach should be used with databases.
The best way is to put a proxy in front of your database. Do not apply changes directly to the existing production database. Instead, apply changes to a replica, allowing you to shadow test them with production traffic and estimate the impact on correctness and performance. You can translate in-flight queries on the proxy to match the new schema. Once you verify that the new change is perfect, you promote the replica to the primary instance.
Advantages of this approach:
Live experiments: You can run live experiments on your production data and traffic. You can tweak indexes, types, constraints, settings, etc. These experiments can be short or long-running without risking downtime.
Live-testing changes: You can live-test and verify changes on production before making them final. This avoids most mistakes and allows you to revert in seconds, even after a change.
Single-step changes: You can make changes in one step, such as renaming a column or changing its type. Traditionally, this would require upgrading all clients simultaneously or keeping duplicate columns for a while.
Larger changes: Significant changes are also possible, such as merging or denormalizing two tables.
To learn more, reach out on social media or fill out our contact form.