Derik has posted one of those ‘Sprocs are evil’ type rants (he did label it as a rant, so I have to give him props for that). Probably unrelatedly, Jeremy Miller posted something that is sort of related (though his point was really a larger one). Regardless, I have to respond. (Update: I mean responding in general, not line by line discussion of either Derik's or Jeremy's posts...more of a rant using them as a launching pad).
(Update #2: Derik wants to make it clear that he doesn't think sprocs are evil. I think it is a semantic point, but just to make it clear, he thinks business logic in sprocs is evil. This doesn't change anything, but he seems to feel strongly about making the distinction, so consider it made.)
disclaimer #1: I have no doubt that Derik is a better developer than I am. Same goes for Jeremy (doubly).
disclaimer #2: I’ve been a SQL developer (among other things) for about 10 years.
disclaimer #3: despite, or perhaps because of, #2, I caught the ORM bug about 5 years ago, with the Wilson ORMapper. I’ve used NHibernate (but no where near as much as I’d like to), and more recently, on a number of projects, I am using Linq to SQL (yeah, I know, it’s dead, and yeah, it may or may not be a full blown ORM tool…whatever). When writing my own code, I don’t want to handwrite SQL or use sprocs. At all, if I can get away with it.
Having said all that, I have found any number of times where using T-SQL in general or sprocs in particular. In no particular order, here are some of the reasons why:
ETL: if you’ve ever found yourself having to move large amounts of data (for the hell of it, let’s say 1 GB or more) from one source to another, especially where the target is a SQL database, T-SQL in my experience is almost always more efficient than managed code. Even if you can get the data into memory in the first place to do any of the ‘T’ (Transform) of ETL that is required, you are better off letting the SQL engine do it, especially if it is set-based.
Performance: most of the ‘debate’ about sprocs vs. code when it comes to performance has seemed to have been about how SQL caches execution plans or network traffic. As has been beaten to death already, this is irrelevant these days. While a ‘raw’ query sent over the wire may be larger than a sproc call, this almost never has a noticeable impact. We aren’t connecting to our SQL data-stores over a modem. And when parameterized (as almost any decent ORM will do for a dynamic query), a ‘raw’ query will result in an execution plan that is cached just as efficiently as a sproc. So, none of that matters these days.
What does matter are indexes (and also statistics, but I’ll ignore those for now). Any developer worth anything working against a data store will know what a join is (if they don’t, get them far, far, away from your system). But they don’t, often times, know the importance of indexes, how they are tuned, and how to write code that properly uses them. And when you are writing code that hits a table that has a couple hundred thousand (much less millions) of rows, knowing how to do this is critical to performance.
Unless otherwise restricted, an ORM will happily allow a developer to write code that happily produces where clauses that doesn’t properly use indexes. Even against a small dataset, badly performant queries are easily produced, but on a much larger dataset, this can wreak havoc on a production system. This is why so many SQL developers and DBA-types hate ORM technology.
Even a good SQL developer can get this wrong because, as I like to joke at times, SQL seems to emit quantum like behavior (it doesn’t actually, of course). Understanding execution plans takes some knowledge. With SQL Server, for instance, a Merge Join can sometimes perform better than an Apply, and sometimes perform worse. I bet that NHibernate can help ‘hint’ which to use (or at least, given my ignorance of how you would get it to do this, I wouldn’t bet against it), but almost no developer is going to know which to produce even if they knew how to get their ORM of choice to produce it. Have you ever had to answer a test question about whether Exists or In is faster in a query? The textbook answer is obvious, but any SQL developer knows that either one can be faster than the other, depending. Joining on a sub-select is generally the right way to go either way. Does your ORM generate SQL code that knows to use Exists vs. In vs. a sub-select?
Very recently, I had to write code that updated a couple of values in each row of a table that had a little less than one hundred thousand rows. Thinking of it purely logically (in the sense of, given A, B, and C, I want D and E to happen), I wrote some T-SQL that did it, quite efficiently, thank you. I’ve worked with brilliant SQL developers enough to know I’m not one of them, but I’m pretty good, thank you again. But the update statement hung the server. Looking at what the code did, there was no obvious reason why. In order to get it to work properly, I had to change the T-SQL to perform the exact same update in batches of 100 (I think I increased the batch size to 1000 eventually) and it executed successfully in a couple of minutes.
Forcing people to use sprocs can force them to use code that has been analyzed and tweaked to use all available indexes (or caused the DBA-types to create the relevant indexes to support the operations required). Is it *possible* that this can always be done from managed code? Since I used the word ‘always’, the answer is no. Can it be *often* done from managed code? I doubt it.
Do you have a large dataset you are going to be hitting in your production environment? Take a code-base that allows one to use an IRepository that has a Find method or a Query method that can take in an ad-hoc query. Then, go grab the pager and be willing to do L3 support when the tickets roll in about why the list page doesn’t return (yes, even a list page).
Testability: Derik mentioned that one of the problems with having business logic in sprocs is that it inhibits testability, especially when it comes to TDD. If you have to call a sproc from managed code, then you have a problem in setting up the quickly repeatable unit tests that are the hallmark of writing code test first (stated this way so that BDD counts as much as TDD, etc.). This is, as far as I can tell, entirely accurate. Most ORMs that are worth anything make it easy enough to write managed code that calls a sproc (so we don’t have to go back to the ‘dark’ days of .NET data access and call a SQLCommand, set the type to procedure, map the data to each input variable, and all that painful stuff), but when doing test first development, it is a lot harder to mock or stub this out. I don’t think there is much to debate about this.
But, I’m sorry, if you are going to be hitting large datasets in your production environment, fully legitimate tests *have* to take performance into account. This isn’t premature optimization, it is a known variable. Surely (don’t call me ‘Shirley’), you want to test the functionality of one’s code outside of other concerns, and that is a legitimate desire to have. But if it is *known* that you have a data store that is in the tens of TBs, much less tens of GBs, code that fails to perform adequately is failed code, regardless of whether or not it meets the design specifications.
Also very recently, I decided, in one of my own projects (so I wasn’t hampered by client requirements, since I had control over that) to write a stored procedure, the only stored procedure in the current code base. Why? What I needed to do was to create a temp table, and then, off of the input variable, query 4 separate tables that each updated the temp table, and then at the end, join that temp table back to a key table and return a resultset. Could I have done that purely in managed code? Probably a bad question. Could Ayende have done it in 3 lines of Boo? Probably. But Ayende isn’t working on my project, and he isn’t working on your project either. In this case, I needed a final resultset that was best produced by letting the database engine produce it. The point remains that T-SQL is very good at certain things, and should be allowed to do it.
If one is writing a ‘trivial’ (in the sense of technical requirements) application like Dimecasts or a basic eCommerce site (as one of my current projects entails), should you write a bunch of stored procedures to perform your business logic? Hell no. Using Linq to SQL code, from p in db.Products where p.ID == id select p, will get you what you need, 99% of the time. Even with more complicated business logic, from a developer productivity perspective, and from a performance perspective, you aren’t going to need it, where ‘it’ references sprocs. But don’t think for a minute that from these basic cases you can make the case that sprocs are evil.
In a perfectly ideal world, all of your managed code developers will understand relational databases as well as your sql developers, and will know how to make the adjustments to their code accordingly. In this same perfectly ideal world, the ORM of choice of the managed code developers will allow them to create managed code that utilizes the full power of the data store they are using. If you are not lucky enough to inhabit this perfectly ideal world, don’t reject sprocs because of some generic arguments.
(Some smartass will recommend sharding the database when the datasets get to be too big, or to use OODMBS instead of RDBMS…if you are in a position to do either, god bless (and I think sharding will require some database layer anyway), and I’m available for hire).
If you still think sprocs are evil, go ahead and take ownership of that pager, and I’ll talk to you at 3AM.