Fun with SQL: Analytics and Heirarchical
I've had this problem since yesterday and I believe I finally solved it.
Given this data:
ID DIFF_ID START_DAT END_DATE AMOUNT
--------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
2 4 01-FEB-08 29-FEB-08 0
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
5 4 01-MAY-08 31-MAY-08 0
6 1 01-JAN-08 31-JAN-08 10
7 1 01-FEB-08 29-FEB-08 0
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
For each consecutive time period (month) that there is an amount, count how many buckets, up to six.
First thought was definitely Analytics. I toiled away on what became a very unwieldy query (took more than one page anyway). A whole bunch of LAGs with the same number of ever increasing CASE statements.
My first obstacle overcome was to filter out those that had a 0 for amount. That left me with:
ID DIFF_ID START_DAT END_DATE AMOUNT
---------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
6 1 01-JAN-08 31-JAN-08 10
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
It took a good while to figure that out for some reason.
Once I had that figured, I needed to figure out which were consecutive.
Frank Zhou is always solving puzzles with SQL and I remembered I had responded to one of his about a year ago. If you get a chance, please take a look at his site...he solves some pretty cool puzzles with SQL using the MODEL clause and analytics.
Anyway, his post,
How to find the earliest start date and the latest end date for consecutive transactions in SQL was similar (and my response similar), so I found it to revisit my thinking at the time.
First, I use the LAG function to get the previous row's ID (unique) and call it PREV_ID. I use DIFF_ID in the PARTITION clause (window) and order by END_DATE; then add one to see if the months are consecutive. If that value matches the START_DATE of the current row, it's consecutive and I use LAG again to get the previous row's ID.
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
That produces the following output:
DIFF_ID ID START_DAT END_DATE PREV_ID AMOUNT
-------- ---------- --------- --------- ---------- ----------
1 6 01-JAN-08 31-JAN-08 10
1 8 01-MAR-08 31-MAR-08 10
1 9 01-APR-08 30-APR-08 8 10
1 10 01-MAY-08 31-MAY-08 9 10
4 1 01-JAN-08 31-JAN-08 40
4 3 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 3 10
As you can see, I have 3 records with the PREV_ID populated.
As I am building it, I realize I keep nesting the queries, so in comes the WITH clause (when I first learned of that it was terribly difficult to search for, I didn't know it was also called subquery factoring clause...).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
id,
start_date,
end_date,
TO_DATE( SUBSTR( SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 ), 'MMDDYYYY' ) min_start_date
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
Much better. Note the START WITH and CONNECT BY PRIOR, I created my own heirarchical table to determine another window to PARTITION on (MIN_START_DATE of the consecutive records).
DIFF_ID ID START_DAT END_DATE MIN_START
------- ---------- --------- --------- ---------
1 6 01-JAN-08 31-JAN-08 01-JAN-08
1 8 01-MAR-08 31-MAR-08 01-MAR-08
1 9 01-APR-08 30-APR-08 01-MAR-08
1 10 01-MAY-08 31-MAY-08 01-MAR-08
4 1 01-JAN-08 31-JAN-08 01-JAN-08
4 3 01-MAR-08 31-MAR-08 01-MAR-08
4 4 01-APR-08 30-APR-08 01-MAR-08
Now all I have to do is PIVOT the table (I chose not to use the new PIVOT feature) on DIFF_ID and add an analytic COUNT on my new window (MIN_START_DATE).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
COUNT( CASE WHEN d = 1 THEN 1 ELSE NULL END ) b1,
COUNT( CASE WHEN d = 2 THEN 1 ELSE NULL END ) b2,
COUNT( CASE WHEN d = 3 THEN 1 ELSE NULL END ) b3,
COUNT( CASE WHEN d = 4 THEN 1 ELSE NULL END ) b4,
COUNT( CASE WHEN d = 5 THEN 1 ELSE NULL END ) b5,
COUNT( CASE WHEN d = 6 THEN 1 ELSE NULL END ) b6
FROM
(
SELECT
diff_id,
COUNT( id ) OVER
( PARTITION BY diff_id, SUBSTR(
SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 )
ORDER BY end_date ) d
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
)
GROUP BY diff_id;
And voila!
DIFF_ID B1 B2 B3 B4 B5 B6
------- ---- ---- ---- ---- ---- ----
1 2 1 1 0 0 0
4 2 1 0 0 0 0
Problem solved!
Table creation and data can be found
here.
Labels: sql