ORACLENERD twitter/oraclenerd view chet justice's profile on LinkedIn feed oraclenerd t-shirts Unemployment Clock:

  Fun with SQL: Analytics and Heirarchical
I've had this problem since yesterday and I believe I finally solved it.

Given this data:
       ID    DIFF_ID START_DAT END_DATE      AMOUNT
--------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
2 4 01-FEB-08 29-FEB-08 0
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
5 4 01-MAY-08 31-MAY-08 0
6 1 01-JAN-08 31-JAN-08 10
7 1 01-FEB-08 29-FEB-08 0
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
For each consecutive time period (month) that there is an amount, count how many buckets, up to six.

First thought was definitely Analytics. I toiled away on what became a very unwieldy query (took more than one page anyway). A whole bunch of LAGs with the same number of ever increasing CASE statements.

My first obstacle overcome was to filter out those that had a 0 for amount. That left me with:
        ID    DIFF_ID START_DAT END_DATE      AMOUNT
---------- ---------- --------- --------- ----------
1 4 01-JAN-08 31-JAN-08 40
3 4 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 10
6 1 01-JAN-08 31-JAN-08 10
8 1 01-MAR-08 31-MAR-08 10
9 1 01-APR-08 30-APR-08 10
10 1 01-MAY-08 31-MAY-08 10
It took a good while to figure that out for some reason.

Once I had that figured, I needed to figure out which were consecutive. Frank Zhou is always solving puzzles with SQL and I remembered I had responded to one of his about a year ago. If you get a chance, please take a look at his site...he solves some pretty cool puzzles with SQL using the MODEL clause and analytics.

Anyway, his post, How to find the earliest start date and the latest end date for consecutive transactions in SQL was similar (and my response similar), so I found it to revisit my thinking at the time.

First, I use the LAG function to get the previous row's ID (unique) and call it PREV_ID. I use DIFF_ID in the PARTITION clause (window) and order by END_DATE; then add one to see if the months are consecutive. If that value matches the START_DATE of the current row, it's consecutive and I use LAG again to get the previous row's ID.
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
That produces the following output:
 DIFF_ID         ID START_DAT END_DATE     PREV_ID     AMOUNT
-------- ---------- --------- --------- ---------- ----------
1 6 01-JAN-08 31-JAN-08 10
1 8 01-MAR-08 31-MAR-08 10
1 9 01-APR-08 30-APR-08 8 10
1 10 01-MAY-08 31-MAY-08 9 10
4 1 01-JAN-08 31-JAN-08 40
4 3 01-MAR-08 31-MAR-08 10
4 4 01-APR-08 30-APR-08 3 10
As you can see, I have 3 records with the PREV_ID populated.

As I am building it, I realize I keep nesting the queries, so in comes the WITH clause (when I first learned of that it was terribly difficult to search for, I didn't know it was also called subquery factoring clause...).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
id,
start_date,
end_date,
TO_DATE( SUBSTR( SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 ), 'MMDDYYYY' ) min_start_date
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
Much better. Note the START WITH and CONNECT BY PRIOR, I created my own heirarchical table to determine another window to PARTITION on (MIN_START_DATE of the consecutive records).
DIFF_ID         ID START_DAT END_DATE  MIN_START
------- ---------- --------- --------- ---------
1 6 01-JAN-08 31-JAN-08 01-JAN-08
1 8 01-MAR-08 31-MAR-08 01-MAR-08
1 9 01-APR-08 30-APR-08 01-MAR-08
1 10 01-MAY-08 31-MAY-08 01-MAR-08
4 1 01-JAN-08 31-JAN-08 01-JAN-08
4 3 01-MAR-08 31-MAR-08 01-MAR-08
4 4 01-APR-08 30-APR-08 01-MAR-08
Now all I have to do is PIVOT the table (I chose not to use the new PIVOT feature) on DIFF_ID and add an analytic COUNT on my new window (MIN_START_DATE).
WITH sub
AS
(
SELECT
diff_id,
id,
start_date,
end_date,
( CASE
WHEN LAG( end_date ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC ) + 1 = start_date
THEN
LAG( id ) OVER
( PARTITION BY diff_id
ORDER BY end_date ASC )
ELSE NULL
END ) prev_id,
amount
FROM col_test
WHERE amount > 0
)
SELECT
diff_id,
COUNT( CASE WHEN d = 1 THEN 1 ELSE NULL END ) b1,
COUNT( CASE WHEN d = 2 THEN 1 ELSE NULL END ) b2,
COUNT( CASE WHEN d = 3 THEN 1 ELSE NULL END ) b3,
COUNT( CASE WHEN d = 4 THEN 1 ELSE NULL END ) b4,
COUNT( CASE WHEN d = 5 THEN 1 ELSE NULL END ) b5,
COUNT( CASE WHEN d = 6 THEN 1 ELSE NULL END ) b6
FROM
(
SELECT
diff_id,
COUNT( id ) OVER
( PARTITION BY diff_id, SUBSTR(
SYS_CONNECT_BY_PATH(
TO_CHAR( start_date, 'MMDDYYYY' ), '-' ), 2, 8 )
ORDER BY end_date ) d
FROM sub
START WITH prev_id IS NULL
CONNECT BY PRIOR id = prev_id
)
GROUP BY diff_id;
And voila!
DIFF_ID   B1   B2   B3   B4   B5   B6
------- ---- ---- ---- ---- ---- ----
1 2 1 1 0 0 0
4 2 1 0 0 0 0
Problem solved!

Table creation and data can be found here.

Labels: , , ,

 
  Google Analytics
I've been using Google Analytics on this website since the beginning.

It's an absolutely amazing tool. It allows you to view statistics on site usage from so many different angles. The only thing required is inserting a few lines of javascript on your template (for blogger anyway), and stats are collected.

The dashboard (the main page) allows you to take multiple views of your data and drill down if necessary. It includes a drop-down to select the date range (flash I believe) and also to compare up to 2 metrics at once.



I spend an inordinate amount of time on the site. Any new visitors? How many so far today? What pages are being viewed?



Of course it's not perfect. I'm not sure, other than requiring all users to login, how you can determine the number of unique visitors.

Far and away my favorite view is the one that shows you the distribution by continent, sub continent region, country/territory, or city. Recently they added the ability to drill down as well.



It's definitely cool to see the different people from all over the world who have viewed the site.

Wait, how could I forget? The cost? Free!

Doesn't get any better than that.

Labels: ,

 


Guest Authors

How To

Popular

Previous Posts

Code Projects

Archives
August 2007 / September 2007 / October 2007 / November 2007 / December 2007 / January 2008 / February 2008 / March 2008 / April 2008 / May 2008 / June 2008 / July 2008 / August 2008 / September 2008 / October 2008 / November 2008 / December 2008 / January 2009 / February 2009 / March 2009 / April 2009 / May 2009 / June 2009 / July 2009 / August 2009 / September 2009 / October 2009 / November 2009 / December 2009 / January 2010 / February 2010 / March 2010 /


Aggregated by OraNA