in

How To Discover Duplicate Values In A Desk Utilizing SQL – 2 Greatest Methods


Earlier than immediately leaping into the queries, it’s good to outline standards to search out duplicate data in a desk. There may be situations when sure values in a single column are duplicated or the complete report i.e. values in all of the columns in a particular row are duplicated within the desk.

You’ll discover each the chances and the methods to take care of such duplicated data on this fast learn.

The simplest option to establish the duplicated data is to easily depend what number of occasions every report seems within the desk. And the report which seems greater than as soon as is duplicated.

The perform GROUP BY is extensively utilized in SQL for information aggregation. It means you’ll be able to group the data based mostly on values in a single or a number of columns and get aggregated values comparable to depend, or sum of different columns.

Holding this in thoughts, let’s discover how you could find out duplicated values in a single column.

Discover duplicate values in a single column

There may be conditions when duplicate values are current solely in a single column. The rationale for such duplicate data may be so simple as human error whereas making the info entry or updating the database.

Let’s take an instance from the orders desk and discover out which OrderID are duplicated. As it’s good to depend — what number of occasions every OrderID appeared within the desk — it’s best to group the data by OrderID as proven under.

SELECT OrderID
, COUNT(*) as occurrences
FROM orders
GROUP BY OrderID
Information that appeared greater than as soon as within the desk | Picture by Creator

The highlighted data (OrderIDs) occurred within the dataset greater than as soon as i.e. these are duplicated.

Nonetheless, you don’t have to create separate columns as seen within the above image. You may immediately get the duplicated OrderIDs utilizing the HAVING clause after GROUP BY, as proven under.

SELECT OrderID
FROM orders
GROUP BY OrderID
HAVING COUNT(*) > 1;
Duplicated data | Picture by Creator

So, you get solely the duplicated OrderIDs that are the identical because the highlighted ones within the above desk.

Equally, there may be conditions when the values within the a number of columns for a row are duplicated within the desk.

Discover Duplicate Values In A number of Columns

Though the complete row is duplicated throughout the desk, the logic stays the identical, solely the columns you point out within the GROUP BY clause change.

As a substitute of grouping the data by a single column, right here it’s good to group the data by a number of columns.

Let me present you the way.

Suppose you want to see the data the place a mix of OrderID, Amount, and Product_Category appeared a number of occasions within the desk.

SELECT OrderID
, Amount
, Product_Category
, COUNT(*) as occurrences
FROM orders
GROUP BY OrderID
, Amount
, Product_Category
Discover duplicated data in a number of columns | Picture by Creator

On this method, you’ll be able to see that the highlighted mixture of the values within the columns OrderID, Amount, and Product_Category occurred within the desk greater than as soon as.

Once more it’s good to merely add HAVING COUNT(*) > 1 on the finish of the question to retrieve these duplicated data.

As the method to search out out duplicates will depend on the depend of the variety of occasions a report seems within the desk, you need to use the window perform ROW_NUMBER as properly.

The window perform ROW_NUMBER() assigns a novel sequential quantity to every report within the window outlined utilizing the PARTITION_BY clause.

So, you’ll be able to really outline the window utilizing the identical columns, the place you anticipate to have duplicated values. So, if a report seems a number of occasions, a row variety of greater than 1 will likely be assigned to the duplicated data.

Let’s proceed with the identical instance.

To get the data the place a mix of OrderID, Amount, and Product_Category appeared a number of occasions within the desk, it’s good to outline a window utilizing these columns within the PARTITION_BY clause as proven under.

SELECT OrderID
, Amount
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Amount, Product_Category ORDER BY OrderID) AS row_num
FROM orders
Discover duplicates within the desk utilizing ROW_NUMBER() in SQL | Picture by Creator

That is the way you’ll get all of the data and the corresponding row numbers partitioned by a given set of columns. So, all of the highlighted data the place row quantity is 2 are duplicated data.

You may go the above complete question as a sub-query to the outer SELECT assertion under to get solely the duplicated data.

SELECT OrderID
, Amount
, Product_Category
FROM (
SELECT OrderID
, Amount
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Amount, Product_Category ORDER BY OrderID) AS row_num
FROM orders
) AS subquery
WHERE row_num > 1;
Get duplicated data in SQL | Picture by Creator

Alternatively, should you don’t wish to use the sub-query, you’ll be able to create a CTE and get the info from that CTE utilizing one other question as proven under.

WITH temp_orders AS
(
SELECT OrderID
, Amount
, Product_Category
, ROW_NUMBER() OVER (PARTITION BY OrderID, Amount, Product_Category ORDER BY OrderID) AS row_num
FROM orders
)

SELECT OrderID
, Amount
, Product_Category
FROM temp_orders
WHERE row_num > 1;

This question will even return precisely the identical output. So the selection is yours.

To be taught extra concerning the ROW_NUMBER(), CTE, and GROUP BY, don’t overlook to take a look at attention-grabbing sources on the finish of this learn!


Unbox the Cox: A Hidden Darkish Secret of Cox Regression | by Igor Šegota | Jun, 2023

Anomaly Root Trigger Evaluation 101. The right way to discover the reason for each… | by Mariya Mansurova | Jun, 2023