Tuesday, May 19, 2020

How to Find Duplicate Records in Table | Interview Question

When generally asked for duplicate, every time for most of person start thing about the group by, count clause. Which is correct, but these will not work every where. Count(*), Group by , having are the for beginner level. Student get learnt from basics.

On other hand, when we are working in IT industry then window function are usable in real time examples.

What i think Count(*), group by, having are usable for OLAP database, and window function are usable at OLTP level.

Lets see result with some example:

 XYZ   
 IDFirstName LastName MiddleName 
 1 Gurjeet Singh
 2 Harry Potter 
 3 Gurjeet  Singh  
 4 Prem Singh L
 5 Harry Potter 


When i use the query 

SELECT count(*), FirstName, LastName, MiddleName
FROM XYZ
GROUP BY  FirstName, LastName, MiddleName

So result will be 
CountFirstName LastName MiddleName 
 2 Gurjeet Singh
 2 Harry Potter 
 1 Prem Singh L

I am agree the query return the correct result, but it is providing the Count and Name of Duplicate person. Means Gurjeet is the person exists by 2 times in table.

But for some cases we also need ID along with the duplicate person name, which is not possible with the help of GROUP by. So here we have to use window function.

select * from (
SELECT 
    ID, 
    FirstName, 
    LastName, 
    MiddleName,
    row_number() over(partition by Firstname, LastName, MiddleName order by ID) rowNumber
FROM XYZ
) t where rowNumber > 1

So how query will work, below table represent how the row_number will be assign to records

 XYZ    
 IDFirstName LastName MiddleName  rn
 1 Gurjeet Singh
 1
 2 Harry Potter  1
 3 Gurjeet  Singh   2
 4 Prem Singh L 1
 5 Harry Potter 
 2

When we use the outer query, and apply the filter "where rowNumber > 1" below result set will be appear. Here main focus is column ID.
IDFirstName LastName MiddleName 
 3 Gurjeet Singh
 4 Prem Singh L
 5 Harry Potter 


Now in some cases we also need ID along with the all duplicate person name (only Duplicate), which is possible with the help of using Count(*) using Partition by clause.

select * from (
SELECT 
    ID, 
    FirstName, 
    LastName, 
    MiddleName,
    Count(*) over(partition by Firstname, LastName, MiddleName order by ID) TotalCount
FROM XYZ
) t where rowNumber > 1

XYZ    
 IDFirstName LastName MiddleName  TotalCount
 1 Gurjeet Singh
 2
 2 Harry Potter  2
 3 Gurjeet  Singh   2
 5 Harry Potter  2

No comments:

Post a Comment

web stats