In some situations, we need to get an overview of the data present in a table. Or perhaps we want to check if there are any duplicate values in the table that we do not want. How can we retrieve a list of such duplicate records?
How to Identify Duplicate Rows in an SQL Table?
I will demonstrate the task with a simple example. Let’s assume we have a table of customers named DimCustomer (see the image below). Our customer database has around 20,000 records, and we want to verify if there are customers in the database who appear more than once because we want to record each customer only once.
First, we need to decide which values we will use to detect duplicates. We cannot do it, for example, based on the ‘Name’ column because many people can have the same name.
Testing for duplicates based on email addresses is also not suitable because the same person can be in the database twice with different emails, and we would not detect the duplication. The most reliable way would be to test for duplicates using a unique identifier for a person, such as a social security number.
However, in our table, we do not have that. So, we need to come up with another approach. Let’s say, for illustrative purposes in our case, it should be reasonably reliable to say that unique combinations of FirstName, LastName, Gender, and BirthDate should exist.
SELECT
[FirstName]
,[LastName]
,[BirthDate]
,[Gender]
,COUNT(*)
FROM [AdventureworksDW2016CTP3].[dbo].[DimCustomer]
GROUP BY
[FirstName]
,[LastName]
,[BirthDate]
,[Gender]
HAVING COUNT(*) >1
The script is quite simple. We retrieve all records where the count of identical combinations of FirstName, LastName, BirthDate, and Gender is greater than 1. The result shows 5 records, and in all cases, these individuals appear twice in the DimCustomer table (duplicate count = 2).
“