Friday, November 4, 2011

OLAP Again

The following question

"How do I select the first row in each partition of a table?"
was discussed in these two articles
Unrequited OLAP
OLAP Counseling
and the conclusion seemed to be
"The new FIRST_VALUE() aggregate function together with the OLAP WINDOW clause might seem to be ideal, but it's not. The RANK() function can be made to work with the WINDOW clause, but the Old-School solution using MIN() and GROUP BY is actually simpler... and easier to understand for some (ok, me)."
You can see a side-by-side comparison here. Scroll down to the bottom of that article and you will see this parting shot
"...just because the Old School solution worked it doesn't mean we can give up on learning about OLAP and retreat to the familiar."

Onwards and upwards...


Here's a newer, slightly harder question:
"How do I select the top 10 rows in each partition of a table?"
Let's start with a sample table containing 300 rows in three partitions:
CREATE TABLE t ( 
   partition_id     INTEGER NOT NULL,
   entry_id         INTEGER NOT NULL DEFAULT AUTOINCREMENT,
   data             VARCHAR ( 10 ) NOT NULL,
   PRIMARY KEY ( entry_id ) );

BEGIN
   DECLARE @loop_counter INTEGER;
   SET @loop_counter = 1;
   WHILE @loop_counter <= 100 LOOP
      INSERT t ( partition_id, data ) VALUES ( 10, 'aaa' );
      INSERT t ( partition_id, data ) VALUES ( 20, 'bbb' );
      INSERT t ( partition_id, data ) VALUES ( 30, 'ccc' );
      SET @loop_counter = @loop_counter + 1;
   END LOOP;
   COMMIT;
END;

SELECT * FROM t ORDER BY entry_id;

partition_id,entry_id,data
10,1,'aaa'
20,2,'bbb'
30,3,'ccc'
10,4,'aaa'
20,5,'bbb'
30,6,'ccc'
10,7,'aaa'
20,8,'bbb'
30,9,'ccc'
10,10,'aaa'
20,11,'bbb'
30,12,'ccc'
10,13,'aaa'
20,14,'bbb'
30,15,'ccc'
10,16,'aaa'
20,17,'bbb'
30,18,'ccc'
10,19,'aaa'
20,20,'bbb'

...

20,290,'bbb'
30,291,'ccc'
10,292,'aaa'
20,293,'bbb'
30,294,'ccc'
10,295,'aaa'
20,296,'bbb'
30,297,'ccc'
10,298,'aaa'
20,299,'bbb'
30,300,'ccc'
Here's what the output is supposed to look like:
SELECT [the top 10 rows in each partition of t];

partition_id,entry_id,data
10,1,'aaa'
10,4,'aaa'
10,7,'aaa'
10,10,'aaa'
10,13,'aaa'
10,16,'aaa'
10,19,'aaa'
10,22,'aaa'
10,25,'aaa'
10,28,'aaa'
20,2,'bbb'
20,5,'bbb'
20,8,'bbb'
20,11,'bbb'
20,14,'bbb'
20,17,'bbb'
20,20,'bbb'
20,23,'bbb'
20,26,'bbb'
20,29,'bbb'
30,3,'ccc'
30,6,'ccc'
30,9,'ccc'
30,12,'ccc'
30,15,'ccc'
30,18,'ccc'
30,21,'ccc'
30,24,'ccc'
30,27,'ccc'
30,30,'ccc'
Presumably, the Old-School solution would use TOP 10, but how? It turns out that the OLAP RANK() solution used previously to solve the "first" problem works just as well on the "top 10":
SELECT ranked.partition_id,
       ranked.entry_id,
       ranked.data
  FROM ( SELECT *,
                RANK() OVER partition_window AS entry_rank
           FROM t
         WINDOW partition_window AS (
                   PARTITION BY t.partition_id
                   ORDER BY t.entry_id ) 
       ) AS ranked
 WHERE ranked.entry_rank <= 10
 ORDER BY ranked.partition_id,
       ranked.entry_id;
Logically speaking, the WINDOW clause on lines 7 through 9 subdivides all the rows of t into three partitions (partition_id = 10, 20, 30) then sorts those rows on entry_id within each partition: entry_id = 1, 4, 7... for partition_id = 10, entry_id = 2, 5, 8... for partition_id = 20, and so on.

So far, so good... but we still can't use TOP 10 because that doesn't work on windows or partitions, just the whole result set, and the result set returned by the SELECT on lines 4 through 10 contains all 300 rows...

...and besides, it doesn't have an ORDER BY, and you have to have an ORDER BY to use TOP 10. Sure, the WINDOW clause has an ORDER BY, but TOP doesn't do windows.

The solution lies with the RANK() function call on line 5: it adds a new column called "entry_rank" to the result set, and that column contains the value 1, 2, 3... for the rows in each partition; unlike TOP, RANK() does work with windows and partitions.

The WHERE clause on line 11 does what TOP couldn't do: selects only the top 10 rows in each partition, ordered by entry_id.

The ORDER BY clause on lines 12 and 13 re-sorts the whole thing in the final order for presentation... you pretty much always need an outer ORDER BY because there is no guarantee that the ordering imposed by inner ORDER BY clauses will be preserved, especially not an ORDER BY buried two levels down in a WINDOW clause.


For a real-world example, have a look at this Foxhound FAQ: How do I run adhoc queries on the Foxhound database?

1 comment:

Anonymous said...

alphas! alphas! alphas!
enter for bettas is prohibitted!