Partition Function COUNT() OVER possible using DISTINCT

SqlSql ServerTsqlSql Server-2008-R2Sql Server-2014

Sql Problem Overview


I'm trying to write the following in order to get a running total of distinct NumUsers, like so:

NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])

Management studio doesn't seem too happy about this. The error disappears when I remove the DISTINCT keyword, but then it won't be a distinct count.

DISTINCT does not appear to be possible within the partition functions. How do I go about finding the distinct count? Do I use a more traditional method such as a correlated subquery?

Looking into this a bit further, maybe these OVER functions work differently to Oracle in the way that they cannot be used in SQL-Server to calculate running totals.

I've added a live example here on SQLfiddle where I attempt to use a partition function to calculate a running total.

Sql Solutions


Solution 1 - Sql

There is a very simple solution using dense_rank()

dense_rank() over (partition by [Mth] order by [UserAccountKey]) 
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc) 
- 1

This will give you exactly what you were asking for: The number of distinct UserAccountKeys within each month.

Solution 2 - Sql

Necromancing:

It's relativiely simple to emulate a COUNT DISTINCT over PARTITION BY with MAX via DENSE_RANK:

;WITH baseTable AS
(
    SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
    SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr 
    FROM baseTable
)
SELECT
     RM
    ,ADR
    
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1 
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2 
    -- Not supported
    --,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
    ,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu 
FROM CTE

Note:
This assumes the fields in question are NON-nullable fields.
If there is one or more NULL-entries in the fields, you need to subtract 1.

Solution 3 - Sql

I think the only way of doing this in SQL-Server 2008R2 is to use a correlated subquery, or an outer apply:

SELECT	datekey,
		COALESCE(RunningTotal, 0) AS RunningTotal,
		COALESCE(RunningCount, 0) AS RunningCount,
		COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM	document
		OUTER APPLY
		(	SELECT	SUM(Amount) AS RunningTotal,
					COUNT(1) AS RunningCount,
					COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
			FROM	Document d2
			WHERE	d2.DateKey <= document.DateKey
		) rt;

This can be done in http://rextester.com/UZD47460">SQL-Server 2012 using the syntax you have suggested:

SELECT	datekey,
		SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM	document

However, use of DISTINCT is still not allowed, so if DISTINCT is required and/or if upgrading isn't an option then I think OUTER APPLY is your best option

Solution 4 - Sql

I use a solution that is similar to that of [David][1] above, but with an additional twist if some rows should be excluded from the count. This assumes that [UserAccountKey] is never null.

-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
) 
+ dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1

[An SQL Fiddle with an extended example can be found here.][3]

[1]: https://stackoverflow.com/users/3409975/david "David" [2]: https://stackoverflow.com/users/3038209/bf2020 "bf2020" [3]: http://sqlfiddle.com/#!6/6e5a8/4 "SQL Fiddle"

Solution 5 - Sql

There is a solution in simple SQL:

SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users

=>

SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
    SELECT user, MIN(time) AS time
    FROM users
    GROUP BY user
) t

Solution 6 - Sql

I wandered in here with essentially the same question as whytheq and found David’s solution, but then had to review my old self-tutorial notes regarding DENSE_RANK because I use it so rarely: why DENSE_RANK instead of RANK or ROW_NUMBER, and how does it actually work? In the process, I updated that tutorial to include my version of David’s solution for this particular problem, and then thought it might be helpful for SQL newbies (or others like me who forget stuff).

The whole tutorial text can be copy/pasted into a query editor and then each example query can be (separately) uncommented and run, to see their respective results. (By default, the solution to this problem is uncommented at the bottom.) Or, each example can be copied separately into their own query-edit instance but the TBLx CTE must be included with each.

--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES 
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     8  ))

WITH /* SQL-Server version */
TBLx    (Col_A, Col_B) AS
  (SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     8)

/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/

  --SELECT Col_A, Col_B,
  --  ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
  --  RANK() OVER(PARTITION BY Col_A ORDER BY Col_B)       AS RANK_,
  --  DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
  --FROM TBLx

  /* RESULTS:
    Col_A  Col_B  ROW_NUMBER_  RANK_  DENSE_RANK_
      7      7        1          1        1
      7      7        2          1        1
      7      7        3          1        1
      7      8        4          4        2

     ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
                 That is, it’s an order-value (based on "sort" order) but makes no other distinction.
                 
           RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
                 which is *unique* with regard to the others.
                 That is, each identical row is ranked by the rank-order of the first row-instance of that
                 (identical) value-set.
                 
     DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
                 assigned a value of 2.
                 That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
  */

/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/

--  SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
--                  -- the calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT,
--                  -- its value for each original-data row-type would just be replicated for each of those rows.
--                  
--    Col_A, Col_B,                
--    DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
--  FROM TBLx

  /* RESULTS:
    Col_A  Col_B  DISTINCT_ROWTYPE_COUNT_
      7      7            1
      7      8            2
  */

/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/

--  SELECT
--    Col_A, Col_B,
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--    
--    -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
--       +
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
--      - 1   /* (Because DENSE_RANK values are one-based) */
--      AS ROW_TYPES_COUNT_
--  FROM TBLx

  /* RESULTS:
    COL_A  COL_B  ROW_TYPES_COUNT_DESC_  ROW_TYPES_COUNT_ASC_  ROW_TYPES_COUNT_
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      8            1                     2                    2
      
  */

/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/

  SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
                  -- calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT, that
                  -- value would just be replicated for each original-data row.
                  
--    Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
                    -- must be excluded because their respective differing row-values will defeat the purpose of the
                    -- DISTINCT operator, above.
                   
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
       +
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
      - 1   /* (Because DENSE_RANK values are one-based) */
      AS ROW_TYPES_COUNT_
  FROM TBLx
  
  /* RESULTS:

    ROW_TYPES_COUNT_
          2
  */

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionwhytheqView Question on Stackoverflow
Solution 1 - SqlDavidView Answer on Stackoverflow
Solution 2 - SqlStefan SteigerView Answer on Stackoverflow
Solution 3 - SqlGarethDView Answer on Stackoverflow
Solution 4 - SqlLars RönnbäckView Answer on Stackoverflow
Solution 5 - Sqlk06aView Answer on Stackoverflow
Solution 6 - SqlpstratonView Answer on Stackoverflow