SQL Duplicates in a Table with an Example - How to Retrieve a List of Duplicate Values?

28. 1. 2022
Ing. Jan Zedníček - Data Engineer & Controlling
SQL Basics, Useful SQL Scripts
0

In some situations, we need to get an overview of the data present in a table. Or perhaps we want to check if there are any duplicate values in the table that we do not want. How can we retrieve a list of such duplicate records?

How to Identify Duplicate Rows in an SQL Table?

I will demonstrate the task with a simple example. Let’s assume we have a table of customers named DimCustomer (see the image below). Our customer database has around 20,000 records, and we want to verify if there are customers in the database who appear more than once because we want to record each customer only once.

First, we need to decide which values we will use to detect duplicates. We cannot do it, for example, based on the ‘Name’ column because many people can have the same name.

Testing for duplicates based on email addresses is also not suitable because the same person can be in the database twice with different emails, and we would not detect the duplication. The most reliable way would be to test for duplicates using a unique identifier for a person, such as a social security number.

However, in our table, we do not have that. So, we need to come up with another approach. Let’s say, for illustrative purposes in our case, it should be reasonably reliable to say that unique combinations of FirstName, LastName, Gender, and BirthDate should exist.

SELECT
[FirstName]
,[LastName]
,[BirthDate]
,[Gender]
,COUNT(*)
FROM [AdventureworksDW2016CTP3].[dbo].[DimCustomer]
GROUP BY
[FirstName]
,[LastName]
,[BirthDate]
,[Gender]
HAVING COUNT(*) >1

The script is quite simple. We retrieve all records where the count of identical combinations of FirstName, LastName, BirthDate, and Gender is greater than 1. The result shows 5 records, and in all cases, these individuals appear twice in the DimCustomer table (duplicate count = 2).

“

Rate this post

Ing. Jan Zedníček - Data Engineer & Controlling

My name is Jan Zedníček and I have been working as a freelancer for many companies for more than 10 years. I used to work as a financial controller, analyst and manager at many different companies in field of banking and manufacturing. When I am not at work, I like playing volleyball, chess, doing a workout in the gym.

🔥 If you found this article helpful, please share it or mention me on your website

Leave a Reply Cancel reply

ETL | Mage.ai – Charts, Analysis, Testing, Overview, Cleansing
In this guide, we will take a look at the features that Mage.ai offers for data analysis. While this tool is primarily used for ETL pipelines, it […]
ETL | Mage.ai – Database configuration in io_config.yaml and secrets (passwords)
In this guide, we will take a look at how to configure the io_config.yaml file in Mage.ai. We will also explore how to hide and encrypt access […]
Mage.ai | Error UnicodeDecodeError: ‘charmap’ codec – Windows
This article will be related to troubleshooting. Today, I managed somehow to write a comment that caused the entire Mage.ai instance to crash due to […]
ETL | Mage.ai – Dbt Installation (pip/conda) and project initialization
In the previous article – ETL | Mage.ai – Solid Alternative to Airflow – Intro and Installation we introduced the ETL tool Mage.ai […]
ETL | Mage.ai Pipeline – data flow – Python, SQL Server
In a recent article dedicated to introducing Mage.ai – a tool for creating and managing ETL processes, I promised at the end that we would try […]
Bulk Copy Program (BCP) Utility – Fast Bulk Import and Export in SQL Server
BCP is a utility that is installed by default with SQL Server editions and is used for bulk import or export of a large volume of data in […]
SQL Server Table and Index Compression (Data Compression), Pros/Cons
Table and index compression is a functionality that has been available in various SQL Server editions for a while. It has been available in all […]
SSRS – Handling multiple value parameter/filters in reporting services
In the past, I have written several tutorials on reporting services (you can find them in the reporting services – SSRS category). I have gone […]
Data Masking in SQL Server – How to Hide Data in a Specific Column
Data masking is a feature that allows you to completely or partially mask selected data in a database. Access to unmask the data can also be granted […]
SSRS | How to Create an Amortization Calculator in SQL Server – Including a Report with Parameters
Lately, I’ve been dedicating a lot of time to financial mathematics in Excel. I’ll try to leverage that and shift the focus from Excel to […]

Resources: Power BI News and Blogs a BI Blogs and Magazines – SQL Server, Excel, Reporting

Full vs. Incremental Loads – Data Engineering with Fabric
by John Miner on 17. 4. 2024 at 0:00
Learn how to perform full and incremental loads in Fabric with a little SparkSQL. The post Full vs. Incremental Loads – Data Engineering with […]
Get the most out of SQL Server Agent logs
by Additional Articles on 17. 4. 2024 at 0:00
If you haven’t migrated your workloads to a managed database platform yet, you’re probably still relying on SQL Server Agent for various […]
On-premises data gateway April 2024 release
on 16. 4. 2024 at 16:00
We are excited to announce the April 2024 release of the on-premises data gateway!
Copilot in Power BI: Soon available to more users in your organization
on 16. 4. 2024 at 8:00
We have some exciting announcements to share regarding Copilot in Microsoft Fabric. The information in this blog post has also been shared with […]
SQL Performance Tuning tips for newbies
by Esat Erkec on 15. 4. 2024 at 12:12
The purpose of this article is to give newbies some basic advice about SQL performance tuning that helps to improve their query tuning skills in SQL […]
Finding Sister Locations to Help Each Other: Answers & Discussion
by Additional Articles on 15. 4. 2024 at 0:00
This week’s query exercise asked you to find two kinds of locations in the Stack Overflow database. The post Finding Sister Locations to Help Each […]
Disaster Recovery and High Availability Solutions in SQL Server
by Smit Dagli on 15. 4. 2024 at 0:00
Learn about disaster recovery and high availability options in SQL Server with details on the tradeoffs you make when choosing from Availability […]