Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Comments on How to tackle database migrations failure during application initialization on production systems?

Post

How to tackle database migrations failure during application initialization on production systems?

+3
−0

Context

My ASP.NET Core application got stuck in Production with a 5xx error. By inspecting the logs, I have noticed that a database error occurred during application initialization, namely a SQL Server access error.

This happened because the application restarted and at that very moment, it tried to apply the database migrations (which basically meant just checking __EFMigrationsHistory since everything was already up to date) the SQL Server was not accessible.

This is a very unlikely event (happened once in about 1 month), but it still bothers me because I do not have access to the production myself and those who do are not paying much attention to the health monitoring alerts.

Trying to fix the problem

One quick step that I can take in order to minimize the likelihood of this failure happening is to add connection resiliency, which basically retries the SQL commands for certain errors (e.g. server unavailable, timeouts).

I have also realized that I have never deeply dived into applying migration best practices, so I did this time. The aforementioned article says that migrations should be applied manually in production (and I doing the exact opposite):

It's possible for the application itself to apply migrations programmatically, typically during startup. While productive for local development and testing of migrations, this approach is inappropriate for managing production databases, for the following reasons:

  • If multiple instances of your application are running, both applications could attempt to apply the migration concurrently and fail (or worse, cause data corruption).
  • Similarly, if an application is accessing the database while another application migrates it, this can cause severe issues.
  • The application must have elevated access to modify the database schema. It's generally good practice to limit the application's database permissions in production.
  • It's important to be able to roll back an applied migration in case of an issue. The other strategies provide this easily and out of the box.
  • The SQL commands are applied directly by the program, without giving the developer a chance to inspect or modify them. This can be dangerous in a production environment.

The main reason I chose to rely on automatic database migration for all environments is that there two teams managing a web server and database respectively and upgrading the code and the database at the same time never happens.

I have analyzed the reasons provided by Microsoft for manually applying the migrations:

  • currently, I have a single instance, so collisions cannot occur.
  • elevated rights - I have created a separate login and the migrations use their own connection string
  • migrations are applied as transactions. The first failure rolls back and stops. I do not understand how this point is an issue
  • no inspection - my current setup involves a pre-production environment restored each night. PreProd = Copied Production + anonymizations. This means I get the latest schema from Production each day. If I can apply migrations on PreProd, I can apply them on Prod. So far, I have never had any issues due to schema when applying migrations.

As I see it, I have to make a choice between the following:

  • manually apply migrations for production (more manual work which is something we want to minimize)
  • keep the automatic database migrations run, add the resilience and accept that the application restart has a very small change (maybe once in 1-2 months) of getting stuck (but still know a couple of minutes later due to health check monitoring)
  • keep the automatic database migrations run, but allow the application to run even if a migration fails (and maybe sent an e-mail to dev team to analyze the context). The only risk here would be that the app might run with an outdated schema (very unlikely)

How should I approach this?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (2 comments)
General comments
Derek Elkins‭ wrote over 3 years ago

The issues that might prompt rollback aren't just "the migration failed to complete successfully". Commonly, after deploying (especially database migrations), it's common to do at least some kind of smoke testing to make sure the production environment is behaving as expected. If these tests fail, rolling back is usually the easiest and safest option. For example, incorrect paths/credentials accidentally being hardcoded might be valid for PreProd but not Prod and completely bring down Prod.

Alexei‭ wrote over 3 years ago

@DerekElkins Yes, this can also happen and a smoke is complementary to the deployment using migrations. Also, production has a health monitoring in place which also checks that SELECTs can be performed against the database.