Eliminating Duplicate Rows

Home | Previous Page | Next Page Appendix C. Application Tuning Examples > Improving Large Data Updates >

Eliminating Duplicate Rows

If a table must not contain any duplicates, you can make the non-duplicated column a primary key or create a unique index on the column. This solution does not work for applications that do not require globally detached indexes because it forces fragmentation on the non-duplicated column when other fragmentation schemes might be better for performance.

In addition, if such constraints exist, large amounts of new data must be loaded in deluxe mode, which is slower because it does not allow light appends to the table and must access the index each time it inserts a row.

It is faster to check for duplicates and eliminate them after you load the data in express mode. Locating and deleting duplicate rows is easier if the table definition contains a unique value, such as a serial key or a time stamp. Then you can write SQL statements that find rows with duplicate values in key columns but different time stamps and eliminate the rows with earlier time stamps.

First, for each duplicated key value, the insert timestamp of the row should survive into TT:

INSERT INTO TT
   SELECT k, MAX (t)
   FROM T
   GROUP BY k
   HAVING COUNT (*) > 1

Then delete duplicate rows as follows:

DELETE FROM T
   WHERE EXISTS
   (SELECT 1 FROM T, TT T.k = TT.k AND T.t <> TT.t)