Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
When stored procedures are preferred over application layer code?
A person I used to work with several years ago was hired to rewrite a product using a .NET-based modern tech stack. One of the aspects that stroke me was that he believes that the product should mostly rely on stored procedures and less on the ORM (e.g. Entity Framework). I have also heard this argument coming from a couple of formerly technical managers where I work.
The main argument is the performance and I think this is due to the inefficiency of handling batch operations of the EF .NET which have mostly been solved in the latest EF (.NET Core) versions.
This SO answer provides a lot of insight why it is bad to use stored procedures: harder to maintain (debugging, configuration, deployment etc.), DRY violation, rigidity, harder to test.
Performance-wise, from my experience, correctly implemented applications should almost never rely on stored procedures since you almost never serve a lot of data for your clients (i.e. use pagination for both lists, grids and API results).
The only valid case for the stored procedure was heavy computations (i.e. involving a high volume of data) on the database that could not possibly be justified by the DB - web roundtrips incurred by implementing that logic.
I am wondering if there are any other cases when one should consider using stored procedures.
2 answers
There are a few reasons for wanting to move computation closer to data. One is performance, which you've mentioned. Another is security. Databases enforce their own security boundary, and data that don't leave the database are data that can't be exfiltrated if your application layer is exploited somehow but the database itself is still only accessible via the application layer credentials. A third reason is that it may increase the modularity of your overall system. If your data are being replicated in a complex way, and your application servers are also running multiple versions of your application at once, it may be desirable to standardize on an interface between databases and application layer that is more sophisticated than ‘access all the tables directly’. Said interface can be implemented with views and stored procedures, and then those entities would be versioned and replicated along with the data, independently of how your application layer is versioned and deployed, so data-layer restores from replica can happen in the event of a bad schema change without disrupting the application layer (because, of course, you would roll out the schema change before rolling out the new application layer components that require any new interface exposed). I don't expect this list is exhaustive.
Much like performance, how important these other reasons are to you will vary a lot based on context. You will always be trading complexity off against capability, and there won't be one right answer for all projects.
But in the abstract, I consider moving computation closer to the data it uses to be generally a good idea—the negatives of using stored procedures are IMO all due to the relatively poor overall developer experience of deploying code to database servers versus deploying code to application servers, and that's just a point-in-time state-of-the-art thing (albeit one that has lasted for a rather long time so far).
0 comment threads
(This answer became more ORM v. "direct" SQL. If you're very narrowly focused on just stored procedures, then it's not super important to me that the logic be packaged up in stored procedures. That said, if you aren't using something like an ORM to generate the queries, it's not clear to me what benefit you get over stored procedures. As I elaborate on below, many of the engineering arguments against stored procedures are out-of-date and have a bit of a double standard.)
(I don't know if this goes without saying, so I'll say it. The "stored procedure approach" is absolutely not to stick the bulk of your application logic into stored procedures. Rather, it is to use stored procedures [and views, user-defined functions, etc.] to make a reasonably generic [though still likely application-specific[1]] abstraction layer upon which your application builds.)
Personally, I think ORMs are a bad idea. They add a rather complicated and opaque layer on top an already complicated system. I also find that they strongly encourage poor perspectives on data manipulation. For example, they tend to encourage an object-at-a-time processing perspective that's both inefficient and out of step with the underlying data store, and it encourages viewing these objects as mutable. ORMs also generally don't expose a lot of powerful querying functionality of the relational databases, e.g. recursive queries and window functions. I also find that it encourages a lax attitude toward transaction scoping easily leading to both transactions that are larger than they need to be and smaller than they need to be.
My preference is to use approaches that treat the data from the database as immutable and "plain old data", and that take an accumulate-only approach to updates, e.g. event sourcing patterns or temporal database patterns. I also like to have the granularity of interacting with the database be at natural transactional boundaries[2]. Ideally, the application code would not have to deal with transactions at all.
Indeed, the application code should not know that it's talking to a database at all[3]. It should be structured as if it is (conceptually if not actually) talking to some data access (micro)service. At that point, ORM or stored procedures or something else becomes an implementation detail of that service.
The fact that you felt the need for the qualifiers "latest" and "mostly" on the inefficiency of using Entity Framework (EF), which is over a decade old and made by one of the largest software companies in existence, speaks to your former colleagues' point. Historically, EF did okay right up until it did catastrophically poorly, e.g. query avalanches, at which point you had little to no recourse. You often just had to not do whatever it is that caused the problem, which may well be non-obvious. That may well mean rewriting that request directly in SQL anyway. When things go wrong with EF or go wrong with the queries EF produces (for reasons that may not be EF's fault, e.g. missing indexes), EF makes it far harder to see and understand what's going on.
To actually talk about stored procedures, while many of the arguments against them are legitimate, most of them are not inherent problems of stored procedures and can and have been largely mitigated. Even in 2015, I would say issues around version control, deployment, and testing[4] were all adequately solved in a Microsoft-oriented environment, i.e. Visual Studio and Sql Server. To be clear, I'm not saying there are no issues here, just that they are on par with issues with code. T-SQL (and others like PL-SQL) are definitely worse programming languages than C# (or other .NET languages). I find this mostly doesn't matter as the bulk of the complexity of stored procedures (if they are complex at all) is in the SQL not the T-SQL. Nevertheless, Sql Server supports writing stored procedures in any .NET language nowadays. This feature also eliminates some DRY concerns which frankly can cut either way; if you're going to do background processing with a stored procedure, then DRY might argue against not having other logic in stored procedures so it can share logic with the background processing. Finally, the arguments about e.g. testability struck me as a bit of a double standard as the testability of database accessing code is not free.[5]
Going forward I'm going to use "stored procedure" as a broader thing than a literal CREATE PROCEDURE
, so it will include things like views and dynamic SQL and also things like user-defined function mappings in EF. I have no problem with using EF as a quick way to make data transfer objects and talk to the database. When I say "ORM", I mean viewing the database as an object graph, especially a mutable one. "Stored procedure" will refer to working at the relational level, i.e. typically in a manner where you're explicitly writing SQL.
Your argument about performance is a bit of non-sequitur. A small result set can easily require a large amount of work to produce. A simple query returning a large result set is likely to be I/O-bound so it won't matter what you use. Nevertheless, if your 50 item pages take 10 times longer because they've query avalanched into 51 queries and/or they're doing poorly indexed joins across five tables, then, even if the performance for an individual query is still deemed acceptable, that's still an order of magnitude fewer requests per server you can handle. A more common but insidious issue is contention and isolation levels and these aren't just performance concerns. It is very easy to do things with an ORM that you wouldn't do if manually writing a SQL query, or at least would raise flags. Things like doing a lot of joins, doing very complicated queries for seemingly simple requests, touching a lot of tables, query avalanches. Touching lots of tables or lots of data, especially during updates, can lead to contention across the application degrading performance everywhere, especially latency, and potentially leading to deadlock issues.
Let's work an example along the lines of your example. Let's say I'm making a social media site and I want a friends list. Having learned absolutely nothing about the mental health impacts of social media, I want to add the count of friends each friend has to my friends list. In a Code-First EF project, the core query of the list might go from something like me.Friends.Select(friend => friend.Name)
to something like me.Friends.Select(friend => { friend.Name, friend.Friends.Count() })
. As code operating on in-memory data structures, this would likely be pretty reasonable. As a query, this might involve joining a large Friends table to itself and also to a large Users table and doing a grouped aggregate. The win on the ORM side is that this is a fairly simple code change, much simpler than the corresponding change in SQL. Of course, that change in SQL is the difference between a very efficient query (given suitable indexes) and a questionable query[6]. If you do decide there's a performance issue you want to address, it's not really clear what to do with the EF. At the level of SQL there are a myriad of easy and natural options to improve the situation without changing the application's interface to the database at all. For example, it would be easy to use different isolation levels for getting the list of friends and getting their friend counts, and/or splitting these into two separate transactions. We could (manually) materialize a view of the aggregation[7] which we maintain in the stored procedures that (un)friend someone or via a background process. Most of these could also be done in the EF version, but they would be less natural and less obvious. They would also pollute the application's view of the database with physical representation concerns that the stored procedure layer can hide.
Ultimately, one of my biggest issues with ORMs in general, is they feed into developers' code-as-primary, data-secondary mindset which I think is dangerously wrongheaded. Particularly things like "code-first" EF encourages developers to just not think about data representation and querying concerns, let alone learn about them. In many cases, careful design of the schema and queries is the biggest lever you can pull for performance and scaling. Understanding isolation levels and how your data relates to itself (and caches!) is also crucial for correctness and scaling.
All this said, there are certainly contexts where I wouldn't have a problem with something like code-first EF and might even recommend it. If your database is unlikely to get "large" (say 1 million rows across all tables, which frankly would be tiny by today's standards) and doesn't involve any complex query logic, then code-first EF is likely an adequate and convenient solution. Some databases associated to microservices may be good examples.
For most of this answer I've focused on a Microsoft environment (.NET, Sql Server, EF) partially because that's the environment I'm most familiar with and partially because that's the environment the OP's question alludes to. Also, my impression is that Microsoft is doing one of the better jobs around tooling, and so is a better illustration of what e.g. the deployment story can be as opposed to what it was historically. For completeness albeit slightly more theoretical, some RDBMSes, such as VoltDB, are specifically geared towards stored procedures due to how they handle replication and logging. In this case, not using stored procedures would be misusing VoltDB.
-
The one OLTP database that many application interact with is a much rarer thing nowadays. Most relational databases are associated with a single application unless they're being used for analytics. ↩︎
-
I've found Sql Server's table-valued parameters very useful here. ↩︎
-
Nowadays, I'm more of the mind that application code should actually not be talking to a database but should instead be using a log like Kafka. That log probably will be consumed by something that writes to a database, but that's not the application code's concern. ↩︎
-
In fact, I found testing stored procedures significantly nicer than testing code as stored procedures are usually effectively pure functions and having the full power of SQL to check results is nice. ↩︎
-
The double standard being that we're comparing bad or at least "typical" stored procedures against (atypically) "clean" code where, in practice, there are often a lot of difficulties being able to mock database access, or it happens at a level that would apply just as well or even better if stored procedures were being used. That said, the story around testability here on the code side has also gotten noticeably better over time. ↩︎
-
This example is probably so simple as not to be problematic as the indexes will likely provide size information readily. ↩︎
-
In this case, Sql Server should be able to handle this through its limited materialized view features, so it could be maintained automatically. ↩︎
0 comment threads