Knowledge is Here...: How to Find Duplicate Records in Table

Tuesday, May 19, 2020

How to Find Duplicate Records in Table | Interview Question

When generally asked for duplicate, every time for most of person start thing about the group by, count clause. Which is correct, but these will not work every where. Count(*), Group by , having are the for beginner level. Student get learnt from basics.

On other hand, when we are working in IT industry then window function are usable in real time examples.

What i think Count(*), group by, having are usable for OLAP database, and window function are usable at OLTP level.

Lets see result with some example:

XYZ
ID	FirstName	LastName	MiddleName
1	Gurjeet	Singh
2	Harry	Potter
3	Gurjeet	Singh
4	Prem	Singh	L
5	Harry	Potter

When i use the query

SELECT count(*), FirstName, LastName, MiddleName
FROM XYZ
GROUP BY  FirstName, LastName, MiddleName

So result will be

Count	FirstName	LastName	MiddleName
2	Gurjeet	Singh
2	Harry	Potter
1	Prem	Singh	L

I am agree the query return the correct result, but it is providing the Count and Name of Duplicate person. Means Gurjeet is the person exists by 2 times in table.

But for some cases we also need ID along with the duplicate person name, which is not possible with the help of GROUP by. So here we have to use window function.

select * from (
SELECT 
    ID, 
    FirstName, 
    LastName, 
    MiddleName,
    row_number() over(partition by Firstname, LastName, MiddleName order by ID) rowNumber
FROM XYZ
) t where rowNumber > 1

So how query will work, below table represent how the row_number will be assign to records

XYZ
ID	FirstName	LastName	MiddleName	rn
1	Gurjeet	Singh		1
2	Harry	Potter		1
3	Gurjeet	Singh		2
4	Prem	Singh	L	1
5	Harry	Potter		2

When we use the outer query, and apply the filter "where rowNumber > 1" below result set will be appear. Here main focus is column ID.

ID	FirstName	LastName	MiddleName
3	Gurjeet	Singh
4	Prem	Singh	L
5	Harry	Potter

Now in some cases we also need ID along with the all duplicate person name (only Duplicate), which is possible with the help of using Count(*) using Partition by clause.

select * from (
SELECT 
    ID, 
    FirstName, 
    LastName, 
    MiddleName,
    Count(*) over(partition by Firstname, LastName, MiddleName order by ID) TotalCount
FROM XYZ
) t where rowNumber > 1

XYZ
ID	FirstName	LastName	MiddleName	TotalCount
1	Gurjeet	Singh		2
2	Harry	Potter		2
3	Gurjeet	Singh		2
5	Harry	Potter		2

Knowledge is Here...

Pages

Labels

Tuesday, May 19, 2020

How to Find Duplicate Records in Table | Interview Question

No comments:

Post a Comment