Monday, March 14, 2011

Fun with SQL - Part 578

by Enrique Aviles [twitter]
The first guest post in awhile. I met Mr. Aviles via twitter (yeah for Twitter!) some time back. We got to meet in real life at our local SOUG meeting with Tanel Põder came to town. He's also graced the front page for awhile wearing his oraclenerd t-shirt (which is now his avatar on Twitter).

Part of my job consists of tuning SQL queries. Opportunities abound and present themselves in many different scenarios. A few weeks ago I was assigned a trouble ticket. I was the lucky winner because, after some investigation by the developers, the culprit was identified as an SQL issue that was causing a stack overflow in Hibernate. For those unfamiliar with it, Hibernate is an Object Relational Mapping library for Java (more info here). The stack overflow occurred during string manipulations probably during the code generation phase. I didn’t have to do much investigation of my own because the developers indicated the offending query was composed of 60 IN clauses, each clause containing 1000 elements. The query resembled something like this:
WHERE (UNIQUE_ID IN (1, 2, 3,..., 1000)
OR (UNIQUE_ID IN (1001, 1002, 1003,..., 2000)
OR (UNIQUE_ID IN (2001, 2002, 2003,..., 3000)
OR (UNIQUE_ID IN ( 59001, 59002, 59003, ..., 60000)
I was happy to see the query generation failed before it reached the database. It is usually better to have the application fail with a glaring error during unit tests rather than allowing a potential failure slip into production. Fortunately we test against a database that’s a slightly older copy of the production database so we execute all our queries using real data volumes, not just against artificial tiny development databases. Had we only tested on a development database, this issue would have made it to production with two possible outcomes:
- The application fails with the Hibernate stack overflow error that triggered the trouble ticket.
- Given a smaller set of unique ids, Hibernate successfully generates the query and sends it to the database.

Since the query never reached the database, it is hard to tell how well it would have performed.

For a handful of IN clauses, it is safe to assume the query would have run in a reasonable amount of time but 60 IN clauses could have possibly taken an unacceptable amount of time. Even if the database is able to process 60 IN clauses with no impact to the user, generating such statements as
UNIQUE_ID IN (1001, 1002,..., 2000)
UNIQUE_ID IN (2001, 2002,..., 3000)
UNIQUE_ID IN (59001, 59002, 59003,..., 60000));
in a query is not a good idea for various reasons. In the absence of bind variables, a lot of CPU will be consumed hard parsing SQL statements since potentially every generated query becomes a brand new query sent to the database even if the number of IN clauses remain static. Using bind variables lessens the load but doesn’t eliminate hard parsing as much as it should because queries with a different number of IN clauses become unique, thus unshareable. Both cases fill the shared pool with unshareable SQL, although the absence of bind variables would fill it faster. This is a waste of CPU cycles and memory. We can significantly reduce hard parsing and eliminate the risk of generating a huge SQL statement by implementing a temporary table.

Instead of selecting and sending to the application server all values that eventually ended in a series of IN clauses, we could simply insert those values in a temporary table and modify the original query to join the temporary table. The following is a sample temporary table:
The table was loaded with the same INSERT that returned all IDs that were used to build the IN clauses. Once the temporary table contained all desired IDs the application generated the following query:
Of course, we could have also generated a query with an INNER JOIN or with a single IN clause but EXISTS worked fine for our case. This is obviously a much shorter SQL text. It remains static regardless of how many IDs are selected since they are hidden in the temporary table. Additionally, there is no need to send all IDs to the application server, spend time building a long character sting, and send them back as part of a SQL query. This translates in one hard parse and many executions so the risk of wasting shared pool and CPU usage is eliminated.

The “magic” of temporary tables makes this a safe solution since data inserted is only available per session. User A cannot select data inserted by user B and vice versa. Temporary data disappears automatically after the session disconnects so there is no need to explicitly DELETE or TRUNCATE the table. At the end of the day, this proved to be a suitable solution.

The fact that IN clauses only support 1000 elements was seen as a limitation. If the database imposes limits chances are defeating them with apparently clever solutions will result in slow performing queries and will increase the risk of wasting system resources. When faced with similar limitations we should ask ourselves if there are other means of achieving the same result that don’t require circumventing what was considered a database shortcoming.


Timur Akhmadeev said...

Huh. Been there, seen this :-D
Another brilliant example on the same topic of passing an object IDs array to a query to IN clause that caught my eye recently: imagine you have a standard object type in the database that is defined as "table of number"; it is heavily used throughout the code; there is a simple API to bind array of this type from Java - with tons of the example usages all over the code and in the local list of "Best practices". Now, when a developer tried to fix "ORA-01795: maximum number of expressions in a list is 1000" he/she did the following: instead of passing an array from Java, the statement was modified so now the query looks like this:
... ID in (select * from table(type_array(1, 2, 3, ..., 1000, 1001, ...)))

Enrique Aviles said...

Yes, definitely a similar situation although your example is a bit more sophisticated since it selects from a object type :)

Clever Idea Widgetry said...

"The fact that IN clauses only support 1000 elements was seen as a limitation."

@Enrique - your duty as a database professional was to slip poison into the coffee of the professional that stated this.