How to tackle database migrations failure during application initialization on production systems?
My ASP.NET Core application got stuck in Production with a 5xx error. By inspecting the logs, I have noticed that a database error occurred during application initialization, namely a SQL Server access error.
This happened because the application restarted and at that very moment, it tried to apply the database migrations (which basically meant just checking
__EFMigrationsHistory since everything was already up to date) the SQL Server was not accessible.
This is a very unlikely event (happened once in about 1 month), but it still bothers me because I do not have access to the production myself and those who do are not paying much attention to the health monitoring alerts.
Trying to fix the problem
One quick step that I can take in order to minimize the likelihood of this failure happening is to add connection resiliency, which basically retries the SQL commands for certain errors (e.g. server unavailable, timeouts).
I have also realized that I have never deeply dived into applying migration best practices, so I did this time. The aforementioned article says that migrations should be applied manually in production (and I doing the exact opposite):
It's possible for the application itself to apply migrations programmatically, typically during startup. While productive for local development and testing of migrations, this approach is inappropriate for managing production databases, for the following reasons:
- If multiple instances of your application are running, both applications could attempt to apply the migration concurrently and fail (or worse, cause data corruption).
- Similarly, if an application is accessing the database while another application migrates it, this can cause severe issues.
- The application must have elevated access to modify the database schema. It's generally good practice to limit the application's database permissions in production.
- It's important to be able to roll back an applied migration in case of an issue. The other strategies provide this easily and out of the box.
- The SQL commands are applied directly by the program, without giving the developer a chance to inspect or modify them. This can be dangerous in a production environment.
The main reason I chose to rely on automatic database migration for all environments is that there two teams managing a web server and database respectively and upgrading the code and the database at the same time never happens.
I have analyzed the reasons provided by Microsoft for manually applying the migrations:
- currently, I have a single instance, so collisions cannot occur.
- elevated rights - I have created a separate login and the migrations use their own connection string
- migrations are applied as transactions. The first failure rolls back and stops. I do not understand how this point is an issue
- no inspection - my current setup involves a pre-production environment restored each night. PreProd = Copied Production + anonymizations. This means I get the latest schema from Production each day. If I can apply migrations on PreProd, I can apply them on Prod. So far, I have never had any issues due to schema when applying migrations.
As I see it, I have to make a choice between the following:
- manually apply migrations for production (more manual work which is something we want to minimize)
- keep the automatic database migrations run, add the resilience and accept that the application restart has a very small change (maybe once in 1-2 months) of getting stuck (but still know a couple of minutes later due to health check monitoring)
- keep the automatic database migrations run, but allow the application to run even if a migration fails (and maybe sent an e-mail to dev team to analyze the context). The only risk here would be that the app might run with an outdated schema (very unlikely)
How should I approach this?