Do all columns in a SELECT list have to appear in a GROUP BY clause

Sql Problem Overview

My lecturer stated:

All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function

I'm just wanting some confirmation of this as I cannot think of a logical explanation as to why it should be true...

Sql Solutions

Solution 1 - Sql

Imagine the following:

    A    B    C
   Cat   10   False
   Dog   25   True
   Dog   20   False
   Cat   5    False

If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?

If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:

    Cat    15
    Dog    45

But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.

Solution 2 - Sql

That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.

But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.

create table a (
  a_id integer primary key,
  xfr_date date not null
);

create table b (
  a_id integer not null references a (a_id),
  recd_date date not null,
  units_recd integer not null 
    check (units_recd >= 0),
  primary key (a_id, recd_date)
);

select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent 
                 -- on a.a_id; it doesn't have to appear in the 
                 -- GROUP BY clause.

The notable deviation from from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.

Solution 3 - Sql

Actually, in MySQL you don't have to group by all columns. You can just group by whatever columns you want. The problem is, it will just pull a random value (from the set of available rows in the group) for the fields which aren't in the group by. If you know that you are grouping by something that is a unique key, there's no point in grouping by the rest of the fields, as they will already all have the same value anyway. It can actually speed it up to not have to group by every field when it is completely unnecessary.

Solution 4 - Sql

If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.

Solution 5 - Sql

So the simple answer is: It depends. Mysql allows it, vertica doesn't.

There is actually a valid use case for omitting and that is when you are already selecting say with MIN().

Here is an actual example for event tracking. Imaging you have credit and purchase events.

For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:

create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');

mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
|       0 |         0 |     1 | b     |
|       0 |         3 |     4 | c     |
|       0 |         7 |     8 | d     |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)

looks good in mysql, does not work in vertica:

ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function

if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.

So my answer ends with a request for comment :) Any ideas?

Solution 6 - Sql

There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.

If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?

The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.

If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?

Solution 7 - Sql

This is an answer for Michael Will's example/question.

SELECT 
    e3.user_id,
    MAX(e3.purchased) AS purchased, 
    e3.spent, 
    e.event
FROM 
    events e
INNER JOIN
(SELECT 
    e1.user_id AS user_id, 
    MIN(e1.created_at) as spent,
    e2.created_at as purchased
 FROM
    events e1
 INNER JOIN
    (SELECT e.user_id, e.created_at from events e WHERE e.event = 'a') e2   
 ON e1.user_id = e2.user_id 
 AND e1.created_at >= e2.created_at 
 AND e1.event != 'a'
 GROUP BY e1.User_ID, e2.created_at
) e3 
ON e.user_id = e3.user_id AND e.created_at = e3.spent
GROUP BY e3.user_id, e3.spent, e.event;

Content Type	Original Author	Original Content on Stackoverflow
Question	user559142	View Question on Stackoverflow
Solution 1 - Sql	Rob P.	View Answer on Stackoverflow
Solution 2 - Sql	Mike Sherrill 'Cat Recall'	View Answer on Stackoverflow
Solution 3 - Sql	Kibbee	View Answer on Stackoverflow
Solution 4 - Sql	hsmiths	View Answer on Stackoverflow
Solution 5 - Sql	Michael Will	View Answer on Stackoverflow
Solution 6 - Sql	Karl	View Answer on Stackoverflow
Solution 7 - Sql	user3133172	View Answer on Stackoverflow