In this article, I will discuss 2 ways to get rid of duplicate rows from your MySQL database table. The first one will work for any MySQL version, but the second one will work for MySQL 8 or a later version.
So, let’s jump into it.
Sample Data
Table Name: bin_quantity
Table Data
id | bin_id | date | quantity |
---|---|---|---|
1 | 269 | 2021-10-11 | 5 |
2 | 269 | 2021-10-11 | 5 |
3 | 269 | 2021-10-15 | 50 |
4 | 269 | 2021-10-15 | 23 |
10 | 100 | 2021-10-15 | 10 |
11 | 100 | 2022-04-15 | 10 |
23 | 98 | 2022-02-28 | 100 |
24 | 277 | 2021-10-28 | 0 |
25 | 277 | 2021-11-28 | 100 |
26 | 278 | 2021-09-28 | 0 |
27 | 278 | 2021-10-28 | 100 |
54 | 279 | 2021-11-05 | 50 |
72 | 276 | 2021-10-29 | 50 |
73 | 276 | 2021-10-29 | 50 |
74 | 276 | 2021-10-29 | 50 |
84 | 276 | 2021-10-29 | 50 |
85 | 276 | 2021-10-29 | 50 |
Table & Data Query
If you want to use this sample data and want to try the processes, then use the following query.
CREATE TABLE `bin_quantity` (
`id` BIGINT(20) DEFAULT NULL,
`bin_id` BIGINT(20) DEFAULT NULL,
`date` DATE DEFAULT NULL,
`quantity` DECIMAL(12,0) DEFAULT NULL
);
INSERT INTO `bin_quantity`(`id`,`bin_id`,`date`,`quantity`) VALUES
(23,98,'2022-02-28',100),
(11,100,'2022-04-15',10),
(10,100,'2021-10-15',10),
(3,269,'2021-10-15',50),
(4,269,'2021-10-15',23),
(2,269,'2021-10-11',5),
(1,269,'2021-10-11',5),
(85,276,'2021-10-29',50),
(84,276,'2021-10-29',50),
(74,276,'2021-10-29',50),
(73,276,'2021-10-29',50),
(72,276,'2021-10-29',50),
(25,277,'2021-11-28',100),
(24,277,'2021-10-28',0),
(27,278,'2021-10-28',100),
(26,278,'2021-09-28',0),
(54,279,'2021-11-05',50);
Problem With This Data
For each bin_id for a specific date, there should be only one row in the table. But there are multiple records for each bin_id and date in the record.
Find Duplicate Rows
Let’s check if there exist any duplicate rows.
As we expect that for each ‘bin_id‘ and ‘date‘ combination only one row should exist. So use the following query to check which rows are duplicates, and also see the count of duplicate rows.
SELECT bin_id, `date`, COUNT(*) AS num_of_rows
FROM bin_quantity
GROUP BY bin_id, `date`
HAVING num_of_rows > 1
From this query, you will get results like the below:
bin_id | date | num_of_rows |
---|---|---|
269 | 2021-10-15 | 2 |
269 | 2021-10-11 | 2 |
276 | 2021-10-29 | 5 |
That means we have duplicate (unexpected data there).
Let’s clean that data.
Delete Duplicate Rows
Use the following methods to detect and delete duplicate rows. Here we have discussed 2 methods for that, let’s take a look one by one:
Method #1: Using Join
This is a general process and will work on any MySQL version.
- Precondition: Make sure that there is at least one column that has a distinct value, it can be a Primary Key or Unique Key. This will be used to differentiate between the rows while comparing.
- Find Duplicate Data: To find the duplicate data – join the same table based on the common criteria column. (in our case it’s the combination of bin_id and date). If you use the following select query then you will get the duplicate rows and for which rows those are duplicates for.
SELECT bq1.*, bq2.*
FROM bin_quantity bq1
INNER JOIN bin_quantity bq2
ON bq1.bin_id = bq2.bin_id
AND bq1.date = bq2.date
WHERE bq1.id < bq2.id
ORDER BY bq1.bin_id ASC,
bq1.date ASC,
bq1.id ASC,
bq2.id ASC;
This Gives the following data:
id | bin_id | date | quantity | id | bin_id | date | quantity |
---|---|---|---|---|---|---|---|
1 | 269 | 2021-10-11 | 5 | 2 | 269 | 2021-10-11 | 5 |
3 | 269 | 2021-10-15 | 50 | 4 | 269 | 2021-10-15 | 23 |
72 | 276 | 2021-10-29 | 50 | 73 | 276 | 2021-10-29 | 50 |
72 | 276 | 2021-10-29 | 50 | 74 | 276 | 2021-10-29 | 50 |
72 | 276 | 2021-10-29 | 50 | 84 | 276 | 2021-10-29 | 50 |
72 | 276 | 2021-10-29 | 50 | 85 | 276 | 2021-10-29 | 50 |
73 | 276 | 2021-10-29 | 50 | 74 | 276 | 2021-10-29 | 50 |
73 | 276 | 2021-10-29 | 50 | 84 | 276 | 2021-10-29 | 50 |
73 | 276 | 2021-10-29 | 50 | 85 | 276 | 2021-10-29 | 50 |
74 | 276 | 2021-10-29 | 50 | 84 | 276 | 2021-10-29 | 50 |
74 | 276 | 2021-10-29 | 50 | 85 | 276 | 2021-10-29 | 50 |
84 | 276 | 2021-10-29 | 50 | 85 | 276 | 2021-10-29 | 50 |
- Delete Duplicate Data: Delete the duplicate rows comparing the unique criteria. (in our case it’s the ‘id’ column).
If you want to keep the rows that are created earlier (have a smaller ‘id’ value) then use this query.
DELETE bq2
FROM bin_quantity bq1
INNER JOIN bin_quantity bq2
ON bq1.bin_id = bq2.bin_id
AND bq1.date = bq2.date
AND bq1.id < bq2.id;
If you want to keep the later rows, use “>” for comparing the id part, like ‘bq1.id > bq2.id`.
After deleting, you will get clean data like this.
id | bin_id | date | quantity |
---|---|---|---|
23 | 98 | 2022-02-28 | 100 |
10 | 100 | 2021-10-15 | 10 |
11 | 100 | 2022-04-15 | 10 |
1 | 269 | 2021-10-11 | 5 |
3 | 269 | 2021-10-15 | 50 |
72 | 276 | 2021-10-29 | 50 |
24 | 277 | 2021-10-28 | 0 |
25 | 277 | 2021-11-28 | 100 |
26 | 278 | 2021-09-28 | 0 |
27 | 278 | 2021-10-28 | 100 |
54 | 279 | 2021-11-05 | 50 |
Method #2: Using “ROW_NUMBER()” Window Function
This process will work on MySQL 8 or later.
Reason: We are going to use a MySQL window function named “ROW_NUMBER()”, which is not available in the version earlier than 8.
Check the documentation of MySQL ROW_NUMBER() window function here: https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_row-number
- Find Duplicate Data: Using the ROW_NUMBER() function of MySQL we can generate a unique row number, based on certain criteria/columns. Like in this case we want to assign a unique id to the rows which have the same
bin_id
anddate
.
Use the following query:
SELECT
ROW_NUMBER() OVER(
PARTITION BY bin_id, `date`
ORDER BY id) AS row_num,
bq.*
FROM bin_quantity bq
ORDER BY bin_id ASC,
`date` ASC,
row_num ASC;
After running this query you will be data like this.
row_num | id | bin_id | date | quantity |
---|---|---|---|---|
1 | 23 | 98 | 2022-02-28 | 100 |
1 | 10 | 100 | 2021-10-15 | 10 |
1 | 11 | 100 | 2022-04-15 | 10 |
1 | 1 | 269 | 2021-10-11 | 5 |
2 | 2 | 269 | 2021-10-11 | 5 |
1 | 3 | 269 | 2021-10-15 | 50 |
2 | 4 | 269 | 2021-10-15 | 23 |
1 | 72 | 276 | 2021-10-29 | 50 |
2 | 73 | 276 | 2021-10-29 | 50 |
3 | 74 | 276 | 2021-10-29 | 50 |
4 | 84 | 276 | 2021-10-29 | 50 |
5 | 85 | 276 | 2021-10-29 | 50 |
1 | 24 | 277 | 2021-10-28 | 0 |
1 | 25 | 277 | 2021-11-28 | 100 |
1 | 26 | 278 | 2021-09-28 | 0 |
1 | 27 | 278 | 2021-10-28 | 100 |
1 | 54 | 279 | 2021-11-05 | 50 |
Here we have a unique ‘row_num’ in a group of rows that have the same ‘bin_id’ and ‘date’.
- Get Ids of Duplicate Data: Now use the following query to get the `id’(s) of those rows with ‘row_num’ greater than one.
SELECT id
FROM (
SELECT
ROW_NUMBER() OVER(
PARTITION BY bin_id, `date`
ORDER BY id) AS row_num,
bq.id
FROM bin_quantity bq
) AS bqr
WHERE row_num > 1;
It will give the ids like below.
id |
---|
2 |
4 |
73 |
74 |
84 |
85 |
id |
- Delete duplicate data: Now you have the ‘id’(s’), so just use a DELETE query to delete the rows which have those ids.
DELETE
FROM bin_quantity
WHERE id IN (
SELECT id
FROM (
SELECT
ROW_NUMBER() OVER(
PARTITION BY bin_id, `date`
ORDER BY id) AS row_num,
bq.id
FROM bin_quantity bq
) AS bqr
WHERE row_num > 1
);
After running this query, you will have clean data like the below:
id | bin_id | date | quantity |
---|---|---|---|
23 | 98 | 2022-02-28 | 100 |
10 | 100 | 2021-10-15 | 10 |
11 | 100 | 2022-04-15 | 10 |
1 | 269 | 2021-10-11 | 5 |
3 | 269 | 2021-10-15 | 50 |
72 | 276 | 2021-10-29 | 50 |
24 | 277 | 2021-10-28 | 0 |
25 | 277 | 2021-11-28 | 100 |
26 | 278 | 2021-09-28 | 0 |
27 | 278 | 2021-10-28 | 100 |
54 | 279 | 2021-11-05 | 50 |