SQL Left Join first match only

SqlSql ServerTsqlJoinGreatest N-per-Group

Sql Problem Overview


I have a query against a large number of big tables (rows and columns) with a number of joins, however one of tables has some duplicate rows of data causing issues for my query. Since this is a read only realtime feed from another department I can't fix that data, however I am trying to prevent issues in my query from it.

Given that, I need to add this crap data as a left join to my good query. The data set looks like:

IDNo	FirstName	LastName	...
-------------------------------------------
uqx		bob		smith
abc		john		willis
ABC		john		willis
aBc		john		willis
WTF		jeff		bridges
sss		bill		doe
ere		sally		abby
wtf		jeff		bridges
...

(about 2 dozen columns, and 100K rows)

My first instinct was to perform a distinct gave me about 80K rows:

SELECT DISTINCT P.IDNo
FROM people P

But when I try the following, I get all the rows back:

SELECT DISTINCT P.*
FROM people P

OR

SELECT 
    DISTINCT(P.IDNo) AS IDNoUnq 
    ,P.FirstName
    ,P.LastName
    ...etc.    
FROM people P

I then thought I would do a FIRST() aggregate function on all the columns, however that feels wrong too. Syntactically am I doing something wrong here?

Update: Just wanted to note: These records are duplicates based on a non-key / non-indexed field of ID listed above. The ID is a text field which although has the same value, it is a different case than the other data causing the issue.

Sql Solutions


Solution 1 - Sql

distinct is not a function. It always operates on all columns of the select list.

Your problem is a typical "greatest N per group" problem which can easily be solved using a window function:

select ...
from (
  select IDNo,
         FirstName,
         LastName,
         ....,
         row_number() over (partition by lower(idno) order by firstname) as rn 
  from people 
) t
where rn = 1;

Using the order by clause you can select which of the duplicates you want to pick.

The above can be used in a left join, see below:

select ...
from x
  left join (
    select IDNo,
           FirstName,
           LastName,
           ....,
           row_number() over (partition by lower(idno) order by firstname) as rn 
    from people 
  ) p on p.idno = x.idno and p.rn = 1
where ...

Solution 2 - Sql

Add an identity column (PeopleID) and then use a correlated subquery to return the first value for each value.

SELECT *
FROM People p
WHERE PeopleID = (
    SELECT MIN(PeopleID) 
    FROM People 
    WHERE IDNo = p.IDNo
)

Solution 3 - Sql

After careful consideration this dillema has a few different solutions:

Aggregate Everything Use an aggregate on each column to get the biggest or smallest field value. This is what I am doing since it takes 2 partially filled out records and "merges" the data.

http://sqlfiddle.com/#!3/59cde/1

SELECT
  UPPER(IDNo) AS user_id
, MAX(FirstName) AS name_first
, MAX(LastName) AS name_last
, MAX(entry) AS row_num
FROM people P
GROUP BY 
  IDNo

Get First (or Last record)

http://sqlfiddle.com/#!3/59cde/23

-- ------------------------------------------------------
-- Notes
-- entry: Auto-Number primary key some sort of unique PK is required for this method
-- IDNo:  Should be primary key in feed, but is not, we are making an upper case version
-- This gets the first entry to get last entry, change MIN() to MAX()
-- ------------------------------------------------------

SELECT 
   PC.user_id
  ,PData.FirstName
  ,PData.LastName
  ,PData.entry
FROM (
  SELECT 
      P2.user_id
     ,MIN(P2.entry) AS rownum
  FROM (
    SELECT
        UPPER(P.IDNo) AS user_id 
      , P.entry 
    FROM people P
  ) AS P2
  GROUP BY 
    P2.user_id
) AS PC
LEFT JOIN people PData
ON PData.entry = PC.rownum
ORDER BY 
   PData.entry

Solution 4 - Sql

Turns out I was doing it wrong, I needed to perform a nested select first of just the important columns, and do a distinct select off that to prevent trash columns of 'unique' data from corrupting my good data. The following appears to have resolved the issue... but I will try on the full dataset later.

SELECT DISTINCT P2.*
FROM (
  SELECT
      IDNo
    , FirstName
    , LastName
  FROM people P
) P2

Here is some play data as requested: http://sqlfiddle.com/#!3/050e0d/3

CREATE TABLE people
(
       [entry] int
     , [IDNo] varchar(3)
     , [FirstName] varchar(5)
     , [LastName] varchar(7)
);

INSERT INTO people
	(entry,[IDNo], [FirstName], [LastName])
VALUES
	(1,'uqx', 'bob', 'smith'),
	(2,'abc', 'john', 'willis'),
	(3,'ABC', 'john', 'willis'),
	(4,'aBc', 'john', 'willis'),
	(5,'WTF', 'jeff', 'bridges'),
	(6,'Sss', 'bill', 'doe'),
	(7,'sSs', 'bill', 'doe'),
	(8,'ssS', 'bill', 'doe'),
	(9,'ere', 'sally', 'abby'),
	(10,'wtf', 'jeff', 'bridges')
;

Solution 5 - Sql

Try this

 SELECT *
 FROM people P 
 where P.IDNo in (SELECT DISTINCT IDNo
              FROM people)

Solution 6 - Sql

Depending on the nature of the duplicate rows, it looks like all you want is to have case-sensitivity on those columns. Setting the collation on these columns should be what you're after:

SELECT DISTINCT p.IDNO COLLATE SQL_Latin1_General_CP1_CI_AS, p.FirstName COLLATE SQL_Latin1_General_CP1_CI_AS, p.LastName COLLATE SQL_Latin1_General_CP1_CI_AS
FROM people P

http://msdn.microsoft.com/en-us/library/ms184391.aspx

Solution 7 - Sql

Use Cross Apply or Outer Apply, this way you can limit the amount of data to be joined from the table with the duplicates to the first hit.

Select 
    x.*,
    c.*
from 
    x
Cross Apply 
    (
        Select 
            Top (1)
            IDNo,
            FirstName,
            LastName,
            ...., 
        from 
            people As p
        where 
            p.idno = x.idno
        Order By 
            p.idno //unnecessary if you don't need a specific match based on order
    ) As c

Cross Apply behaves like an inner join, Outer Apply like a left join

SQL Server CROSS APPLY and OUTER APPLY

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDaveView Question on Stackoverflow
Solution 1 - Sqla_horse_with_no_nameView Answer on Stackoverflow
Solution 2 - SqlT8RBView Answer on Stackoverflow
Solution 3 - SqlDaveView Answer on Stackoverflow
Solution 4 - SqlDaveView Answer on Stackoverflow
Solution 5 - SqlRamppy DumppyView Answer on Stackoverflow
Solution 6 - SqlFiddlesView Answer on Stackoverflow
Solution 7 - SqlFrank RopenView Answer on Stackoverflow