Get last distinct set of records

MysqlDistinct

Mysql Problem Overview


I have a database table containing the following columns:

id   code   value   datetime   timestamp

In this table the only unique values reside in id i.e. primary key.

I want to retrieve the last distinct set of records in this table based on the datetime value. For example, let's say below is my table

id   code   value   datetime               timestamp
1    1023   23.56   2011-04-05 14:54:52    1234223421
2    1024   23.56   2011-04-05 14:55:52    1234223423
3    1025   23.56   2011-04-05 14:56:52    1234223424
4    1023   23.56   2011-04-05 14:57:52    1234223425
5    1025   23.56   2011-04-05 14:58:52    1234223426
6    1025   23.56   2011-04-05 14:59:52    1234223427
7    1024   23.56   2011-04-05 15:00:12    1234223428
8    1026   23.56   2011-04-05 15:01:14    1234223429
9    1025   23.56   2011-04-05 15:02:22    1234223430

I want to retrieve the records with IDs 4, 7, 8, and 9 i.e. the last set of records with distinct codes (based on datetime value). What I have highlighted is simply an example of what I'm trying to achieve, as this table is going to eventually contain millions of records, and hundreds of individual code values.

What SQL statement can I use to achieve this? I can't seem to get it done with a single SQL statement. My database is MySQL 5.

Mysql Solutions


Solution 1 - Mysql

This should work for you.

 SELECT * 
 FROM [tableName] 
 WHERE id IN (SELECT MAX(id) FROM [tableName] GROUP BY code)

If id is AUTO_INCREMENT, there's no need to worry about the datetime which is far more expensive to compute, as the most recent datetime will also have the highest id.

Update: From a performance standpoint, make sure the id and code columns are indexed when dealing with a large number of records. If id is the primary key, this is built in, but you may need to add a non-clustered index covering code and id.

Solution 2 - Mysql

Try this:

SELECT * 
  FROM <YOUR_TABLE>
 WHERE (code, datetime, timestamp) IN
 (
   SELECT code, MAX(datetime), MAX(timestamp)
	 FROM <YOUR_TABLE>
	GROUP BY code
 )

Solution 3 - Mysql

It's and old post, but testing @smdrager answer with large tables was very slow. My fix to this was using "inner join" instead of "where in".

SELECT * 
 FROM [tableName] as t1
 INNER JOIN (SELECT MAX(id) as id FROM [tableName] GROUP BY code) as t2
 ON t1.id = t2.id

This worked really fast.

Solution 4 - Mysql

I'll try something like this :

select * from table
where id in (
    select id
    from table
    group by code
    having datetime = max(datetime)
)

(disclaimer: this is not tested)

If the row with the bigger datetime also have the bigger id, the solution proposed by smdrager is quicker.

Solution 5 - Mysql

Looks like all existing answers suggest to do GROUP BY code on the whole table. When it's logically correct, in reality this query will go through the whole(!) table (use EXPLAIN to make sure). In my case, I have less than 500k of rows in the table and executing ...GROUP BY codetakes 0.3 seconds which is absolutely not acceptable.

However I can use knowledge of my data here (read as "show last comments for posts"):

  • I need to select just top-20 records
  • Amount of records with same code across last X records is relatively small (~uniform distribution of comments across posts, there are no "viral" post which got all the recent comments)
  • Total amount of records >> amount of available code's >> amount of "top" records you want to get

By experimenting with numbers I found out that I can always find 20 different code if I select just last 50 records. And in this case following query works (keeping in mind @smdrager comment about high probability to use id instead of datetime)

SELECT id, code
FROM tablename
ORDER BY id DESC 
LIMIT 50

Selecting just last 50 entries is super quick, because it doesn't need to check the whole table. And the rest is to select top-20 with distinct code out of those 50 entries.

Obviously, queries on the set of 50 (100, 500) elements are significantly faster than on the whole table with hundreds of thousands entries.

Raw SQL "Postprocessing"

SELECT MAX(id) as id, code FROM 
    (SELECT id, code
     FROM tablename
     ORDER BY id DESC 
     LIMIT 50) AS nested 
GROUP BY code
ORDER BY id DESC 
LIMIT 20

This will give you list of id's really quick and if you want to perform additional JOINs, put this query as yet another nested query and perform all joins on it.

Backend-side "Postprocessing"

And after that you need to process the data in your programming language to include to the final set only the records with distinct code.

Some kind of Python pseudocode:

records = select_simple_top_records(50)
added_codes = set()
top_records = []
for record in records:
    # If record for this code was already found before
    # Note: this is not optimal, better to use structure allowing O(1) search and insert
    if record['code'] in added_codes:
        continue
    # Save record
    top_records.append(record)
    added_codes.add(record['code'])
    # If we found all top-20 required, finish
    if len(top_records) >= 20:
        break

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionObiHillView Question on Stackoverflow
Solution 1 - MysqlsmdragerView Answer on Stackoverflow
Solution 2 - MysqlChanduView Answer on Stackoverflow
Solution 3 - MysqleducoloView Answer on Stackoverflow
Solution 4 - MysqlkrtekView Answer on Stackoverflow
Solution 5 - MysqlThe GodfatherView Answer on Stackoverflow