Monday, April 4, 2011

Everything is a query these days - Part 1

So you have an IT problem and you are desperate to solve it. You get some sort of error description and trying to figure out what the error is all about. Or you get a phonecall from an angry customer complaining that your application has malfunctioned. What do you do? The answer is so easy. You google it!

Today I had an issue of duplicate records in a database. My team is in beta testing right now and the application owners found out that in a particular User Interface fetchedduplicate records. In the words of IT that either means some undefined code break somewhere in the application (bad scenario) or duplicate entries in the database (good scenario). After some searching I figured out that it was the latter, i.e. duplicate row entries in a reference tables. A mysql table.

So how did I solve this trivial problem? I simply typed "duplicate row entries in mysql". And the answer was magically there:

1. Duplicate Record Detection

From a vast variety of queries, these were the simpler for a mysql database:

SELECT name
FROM emp_t
GROUP BY name
HAVING COUNT(*) > 1

Assuming here that you do not allow duplicate entries of the name column in your database.

So I performed the searches and found out the duplicate records in question. Of course the next step is to delete these records. How do I do that?

2. Duplicate Record Deletion

The funny thing here is that you do not even have to figure out the mechanics behind the fancy deletion query. All you need is a tmp table just to perform a test prior to deleting the actual duplicate records from your "production db". Easy:

CREATE TABLE emp_tmp
AS SELECT * FROM emp;

The actual delete query may be the following:

DELETE
FROM (
        SELECT t1.name, t1.id
        FROM (
                SELECT name
                FROM emp_tmp
                GROUP BY name
                HAVING COUNT(name)>1
        ) AS t0 
              INNER JOIN emp_tmp t1 
                  ON t0.name = t1.name
) AS t2 
        INNER JOIN
            emp_tmp t3 
                ON t3.name = t2.name
WHERE t2.id < t3.id;

For an sql primer this looks like a tough job. But remember its all about googling, someone else has done that for you!All you need to do is check whether the above query meets your needs, i.e. deletes duplicate records from your target table. After you make sure it works with the temporary table you created, go ahead and try it on the actual table!

Another point here is why and how duplicate records got inside in the first place. This is lack of design for sure. Duplicate records mean that someone should have put a unique integrity constraint in the fields now considered duplicate. But didn't. So get your designs right before the coding begins!

No comments:

Post a Comment