Posts
690
Comments
563
Trackbacks
1
It’s OK to do Reporting off of a RDBMS

Well, that’s a little misleading.  It’s OK to do Reporting off of a RDBMS as long as you do it right, and you should consider other options before committing to it. 

note: I’m using “Reporting” here in the traditional sense, not in the cqrs sense where pretty much anything that doesn’t involve a command is called “Reporting.”  Also, since I mostly know SQL Server, that’s what I’m going to be discussing here.  Also, yes, I know I’m glossing over a hell of a lot of stuff here.

The ‘Problem’

Suppose you have your traditional transactional system (it could be an eCommerce store, trading system, whatever), designed and optimized to handle inserts into it.  Indexes are aimed at preventing locking, data files (especially the transaction log) are located in different places to minimize hot spots and maximize I/O (SAN technology is pretty amazing these days), code is written correctly so that minimal numbers of query plans are generated which are then maximally used, yada yada yada.

Then along comes Sally Business User who wants to write a report that gets back whatever data she’s all hot to get information on, and happens to construct the query in such a way that joins on ten tables, all of which get locked, and which unfortunately returns the Cartesian Product of whatever table you have with the most rows.  Needless to say, the DB locks up and becomes unavailable, requiring a reboot, much gnashing of teeth, yada yada yada.

Of course, if your users are idiots, bad things can happen

But, of course, this is a straw man presentation.  Anybody can come up with stupid scenarios that don’t really address the pros and cons of using an RDBMS for reporting (or using one at all).  Instead, let’s take a closer look at why reporting off of an RDBMS can be problematic and how these problems can be addressed.

Joins can be costly

As a general rule, relational theory says that normalization is good (though this can be taken to an extreme…I once worked on a system where it took something like 8 joins to get a person’s cell phone number, but I digress).  This tends to lead to a proliferation of tables.  This means that when you want to read back related data, you have to join between a larger number of tables than in a denormalized system, and this can be a bad thing for a number of reasons.

A surprisingly large number of developers don’t really understand as much about transaction isolation levels as they really should, and so often times don’t even know how to write their queries with “(nolock)” properly implemented, which can lead to quite a lot of table locking.

The overhead of join conditions themselves (which, BTW, can also use “(nolock)”) isn’t that much (assuming you have proper indexes), additional conditions in the where clause increase, and can lead to very inefficient and costly execution plans if there aren’t proper indexes, they are in the wrong order, or aren’t sargable in the first place.

Aggregation can be costly

One of the greatest managers I ever worked with didn’t really like relational theory or SQL (which was somewhat ironic since data services was his department), and would at times dismissively wave his hand and say “blah blah blah, group by, order by, whatever.”  He knew it was important but didn’t really care much about the details (that’s what I was being paid for).

Well, all that ‘group by, order by, whatever’ can also greatly impact your execution plans for obvious reasons.  Simply selecting a group of rows is much different from selecting a group of rows while also finding your sums, maxes, mins that typically show up in a report.

Functions in where clauses are bad

Taking a piece of code and putting it in an UDF seems like a good idea.  The problem is that putting functions in where clauses makes the where clause non-sargable in most cases.  Even worse, if the function is doing any sort of complicated logic itself, it gets called for *every* row in the result set, not just once.

All this is getting in the way of the business anyway

Back in the dot.com heyday when working on eCommerce systems, the general principle was that our DB should ideally only be used when a customer was trying to give us their credit card number.  Obviously, this was an impossible ideal, but it was still a guiding principle (just as “eliminating crime” is an impossible ideal, but still a guiding principle). 

Well, obviously, reporting goes completely against that principle, which means that your scalability is limited by the amount of resources that are used for it, and as we’ve seen, that amount can be disproportionate to what you really want to be doing.

So, what to do?

Stop using an RDBMS for anything

One tactic to take is to “stop the madness” and not use an RDBMS at all.  Learning set theory, query optimization, etc. seems to require a lot of work, and can be never-ending.  An index that is good today might be bad tomorrow.  Statistics get out-dated.  And Joe the Developer is going to forget some table hint and lock the order table anyway.

And the experience of the Internet has shown that there seem to be hard limits in just how much data can be stored/processed/managed in an RDBMS.  That’s why Google and Amazon (the obvious examples) don’t center their businesses around them (the white papers on their internal architectures are fascinating to read). 

However, throwing the baby out with the bathwater isn’t generally a good option.  Most of us aren’t going to be building systems that scale to the size of Google or Amazon.  It’s good to dream, but deciding on how to architect a system should be based on realistic expectations as to the needs of the business it is supporting, and for almost all businesses (except maybe at the highest of high-ends and lowest of low-ends), an RDBMS hits the sweet spot.

Moreover, while you certainly don’t have to be a Google or an Amazon to use something other than an RDBMS, making that choice requires learning different ways of doing things, with less supporting literature.  SQL has been around for a long time and in many different environments.  While ‘NoSQL’ style databases have been around for ages, there simply isn’t the high level of ‘Google it’ knowledge that you can rely on to solve any particular practical issue you might be facing.  If you don’t know much about SQL, but know that your system is experiencing blocking, you can pretty quickly learn how to identify the causes of it and devise a solution.

Don’t use an RDBMS for reporting, that’s why God invented OLAP

If you want to limit the resources that are hitting your database for non-‘credit card submission’ purposes, then obviously, you should move those ‘non-critical’ resources somewhere else, and an obvious solution is another database.  In fact, if you really want to do it ‘correctly’, put in place an OLAP system.  Take your highly normalized transactions and ship them off to another system that denormalizes it, aggregates it, precisely for data mining and reporting purposes.  The output of such a system will be familiar to anyone who, for instance, uses the Pivot function in Excel.  That’s right, your business users, the ones who want the reports in the first place.

With SQL Server, it comes with the product (well, not the Express version, I guess) as Analysis Services, so if you can already afford the cost of the license, you get it with no additional cost.  So, since it was designed precisely to do the reporting that needs to be done, and it comes with the product, why wouldn’t you use it?  You do have to afford additional hardware (you could use it on the same instance, but that kind of defeats the purpose), but that’s going to be true just about no matter what.  Seems like the obvious answer.

The problem is cost.  Not cost of the product, but opportunity cost.  Learning OLAP concepts and technologies appears to be something like an order of magnitude more difficult than learning OLTP concepts.  Normalization is easier to understand than star schema.  T-SQL is an easier language to work with than MDX.  Because of this, it is harder to find people to staff a business if it uses it.  Significantly harder.

Is this a definitive argument against OLAP?  Of course not.  In fact, if you are the sort of person who likes data, and who would like a fairly secure, fairly well-paid profession so that they could support and raise a family (in other words, the truly important stuff in life other than this geek crap), I would encourage you to learn this stuff.  Action pack subscriptions or MSDN subscriptions aren’t free, but they can give you licenses of SQL Server versions that include SSAS.

But, the main reason why OLAP hasn’t taken off as much as one might think (though I guess that is changing over time) is that there is another, cheaper option.

God also invented Replication

Part of every version of SQL Server (though with some limitations), replication allows you essentially to take your database and copy its data, in near real-time and in as close to the form as it exists in as you want, to another database, more or less automatically.

You still need the additional hardware (since replicating to the same server kind of defeats the purpose), but you can keep your better understood schema, write your reports in T-SQL, and offload almost all of those resources from your main database.  There is a slight overhead in setting up and running replication but it isn’t much, and improves with each version (so, with SQL 7, setting up replication caused table locks, so you had to do it at 3AM, where in 2000, it only caused row locks, etc. etc. etc.).  If a bad query hits your replica and locks it up, then it won’t affect the main DB (as long as was setup that way). 

Is it perfect?  Nope.  You still have to do things right, you still are querying against a highly normalized database (leading to the common “why is my report timing out” problem), there is overhead involved both in terms of actual CPU as well as human overhead in terms of additional monitoring and support, etc. etc. etc.

But many businesses find that it is a perfectly acceptable solution in many situations.

Relational Databases aren’t going anywhere, neither is reporting off of them

It really is okay to do reporting off an RDBMS.  Despite what people think, they are going to be around for a long, long time, and that’s okay too.  Will non-relational databases grow in the market?  That’s hard to predict, but because of the experience of running systems on the Web, I think it will.  And that’s okay too.  The idea that there’s only one way to architect a system is generally nutty anyway.

posted on Sunday, February 07, 2010 4:20 PM
Comments
No comments posted yet.

Post Comment

Title *
Name *
Email
Url
Comment *  
Please add 7 and 4 and type the answer here: