Removing duplicate entries from a large table in MySQL

A few days ago I faced an issue in my Ruby on Rails application. The issue was caused by some duplicate rows in a very large the MySQL table (more than 1,5 million rows). The obvious solution of course is to remove duplicate rows from the table and add a unique constraint on some columns to make sure no duplicate rows will ever appear again.

The question is how to remove duplicate rows? The table schema is like this:

describe table1;

| id           | int(11)       
| category     | varchar(255)  
| image_set_id | int(11)       
| position     | decimal(16,6) 

In my case, it was duplicate on the combination (category, image_set_id). I want to make sure that combined value will always be unique.

Searching on Google and I quickly found the solution. The query is something like this:

DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name

Mapping to our table, we have the query:

DELETE t1 FROM table1 t1, table t2 WHERE t1.id > t2.id AND t1.category = t2.category AND t1.image_set_id = t2.image_set_id

What is the issue with this method? It is a self-join and it is extremely slow. Remember that we have more than 1 million rows so don't expect to see the above query complete soon.

ALTER IGNORE

Luckily, I found another solution which seems to fit my requirement. We can use ALTER IGNORE to add UNIQUE INDEX to the table and remove the duplicate rows at the same time. The query is:

ALTER IGNORE TABLE table1 ADD UNIQUE INDEX index_on_category_image_set_id(category, image_set_id)

But I soon realized it did not work. Why? ALTER IGNORE was removed from MySQL 5.7 while it is exactly the version I am using.

INSERT IGNORE

Finally I found the "best" solution. Instead of trying to remove the current table, we can utilize the INSERT IGNORE query. The steps:

  • Create a temporary table similar with similar schema to the existing table
  • add UNIQUE constraint to the columns we want
  • run INSERT IGNORE to copy data from the original table to the temporary table, any duplicate rows will not be inserted into the temporary table because they violates the UNIQUE constraint (and we simply ignore the error)
  • rename the original table to something else and rename temporary table to the original table. Drop the redundant table

The final SQL statements are:


-- Create temporary table
CREATE TABLE temp_table LIKE table1;

-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(category, image_set_id);

-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM table1;

-- Rename and drop
RENAME TABLE table1 TO old_table1, temp_table TO table1;
DROP TABLE old_table1;

That's it. The script took about 20 seconds to copy data from the original table into the temporary table, but everything else is super fast.