Handling Duplicate Data
You print the month-end report and see two records for account #102.
You've merged databases and need to check for duplicates. You have
both John Smith and John C. Smith at the same address. Are they the
same person? You have a record for John Smith with a 5-digit zipcode
and a record for John Smith with a 10-digit zipcode. All of these
situations involve what may be duplicate records. What can you do to
find the records involved and delete the duplicates?
Browsing the data is one way, but not if you have 10,000 records, and
some situations require human interaction. For example, the computer
cannot make the decision that John Smith and John C. Smith are the
same person. The computer can, however, help you find the records
involved and present them so that the data is easy to view and the
necessary decisions can be made.
The R:BASE DELETE DUPLICATES command deletes identical records from a
table. The command searches and compares records character by
character, and if two records are identical, will delete the second
record. Because the command compares character by character, the
command is not fast. And often, records are not exactly identical.
Two spaces between words instead of one means records are not
identical. An M instead of an N means records are not identical. Most
situations involve data where human interaction is needed to say
"Yes, these records are the same" or where the comparison needs to be
made on one or two or three columns, not the entire row. In R:BASE
4.5 Plus! and higher, you can specify a column list with the DELETE
DUPLICATES command to check for duplication on the specified columns
only. For other situations, you need to find, and actually look at,
the duplicated rows.
Preventing Duplicates
Use R:BASE constraints and data entry rules to prevent duplicate data
from being entered. A primary key constraint requires that the
specified column or columns be unique. You can also create a rule that
requires data entered in a column to be unique. The primary key
constraint or the rule is easily created through the menus. The menus
prompt for the table and column name that must be unique. Another
option is to define a unique index for a column. Each of these
techniques ensures that duplicate data is not entered into a column.
To require uniqueness across a group of columns, you define a
multi-column primary key, a multi-column uniqueness rule, or a
multi-column unique index. The multi-column primary key can be defined
through the menu system in either R:BASE 4.5++ or R:BASE 5.1. The
multi-column unique index can be defined through the menu system in
R:BASE 5.1, but must be defined from the R> prompt in 4.5++. A
multi-column uniqueness rule is defined through the custom rule
option.
The easiest way to define a multi-column uniqueness rule is to start
with the default single column uniqueness rule. Create a single column
uniqueness rule picking the appropriate table and one of the columns
that is part of the unique group. Then select to modify the rule just
created. Add to the WHERE clause comparison conditions for the other
columns that make up the unique group. Follow the same structure as
for the default rule.
For example, here is the WHERE clause from a default rule requiring
the employee last name to be unique:
Employee.emplname IS NOT NULL AND Employee.emplname +
NOT IN (SELECT emplname FROM Employee #T1 +
WHERE #T1.emplname = Employee.emplname)
Since last name is not unique be itself, the rule is modified to add
the first name and address columns to the condition list. The three
columns together, last name, first name, and address, make a unique
set.
Employee.emplname IS NOT NULL AND Employee.emplname +
NOT IN (SELECT emplname FROM Employee #T1 +
WHERE #T1.emplname = Employee.emplname +
AND #T1.empfname = Employee.empfname AND +
#T1.empaddress = Employee.empaddress)
Finding Duplicates
Ever wonder why a company sends you two or three pieces of the same
mailing? All with slightly different names? In this age of computers
why can't they figure out that you're all the same person? Well, the
computer can't make the decision that Smith and Smith are the same
person, only a human can do that. What the computer can do is help
identify possible duplicate records.
R:BASE's constraints and data entry rules can keep duplicate data from
being entered, but constraints and rules don't prevent misspellings or
miskeying of data. An SQL SELECT command is used to search data for
possible duplicates. Notice that the name columns are not used in the
SELECT command to check for duplicates. We want to search for possibly
misspelled names so we don't use name as one of our criteria.
SELECT empid, COUNT(*)....the column, empid, uniquely identifies a row
FROM employee...... this is the table, Employee, to check for
duplicates
WHERE city IN............one of the columns used in the duplicate
comparison
(SELECT city...........a Sub-SELECT is used to create a comparison
list
FROM employee T1
GROUP BY state, city, address...group the data together by the
columns used to check for duplicates
HAVING COUNT(*) > 1)....... ...count the rows in the group, only
return a result if the same state, city, and address occur more than
once GROUP BY empid...........display the id and count only once for
each id
This SELECT command displays the employee ID and the number of times
it's found with the same state, city and street address. If any
records are found, there may be duplicate data where the name or the
zipcode was misspelled or miskeyed. Now those records can be displayed
for editing and someone can review them and decide if they are indeed
duplicates and which one to keep.
Keep the same WHERE clause, but replace the SELECT command with a
command such as EDIT that allows viewing and deleting of the data:
EDIT * FROM employee +
WHERE city IN +
(SELECT city +
FROM employee T1 +
GROUP BY state, city, address +
HAVING COUNT(*) > 1) +
ORDER BY state, city, address, lastname
Add the ORDER BY clause so when the data is displayed, the possible
duplicate records are next to each other. Depending on the size of the
database, this can be a large task. But if you're serious about
finding and removing duplicates, it needs to be done.
Deleting duplicates
Once the potential duplicates are found, a person can review the data
and then delete the duplicate data. This way there is a choice over
which record to keep. The computer can be programmed to do this task,
but it will always keep the first record found.
The easiest way to have the computer delete duplicates, keeping the
first record, is to use the DELETE DUPLICATES command with the
optional USING clause. The USING clause specifies the columns to check
for uniqueness so the computer does not need to check the entire row.
For example, the check for duplicate employee records you might use
this command:
DELETE DUPLICATES FROM employee +
USING ELastName, EFirstName, EAddress
R:BASE checks only the specified columns, ELastName, EFirstName,
EAddress, for duplicate data making for much faster processing time
than if R:BASE had to check the entire row. With the optional USING
clause, the DELETE DUPLICATES command is the fastest way to delete
duplicate rows of data from a table. This option is available in
R:BASE version 4.5 Plus! and higher.
An alternative method involves creating a temporary table and using a
multi-column unique index or a multi-column uniqueness rule. This
method works well for versions of R:BASE prior to R:BASE 4.5 Plus!.
1. Create a temporary table with the same structure as the table that
has the duplicate records.
PROJECT temp FROM employee USING ALL WHERE LIMIT=0
2. Create a multi-column uniqueness rule referencing the temporary
table, Temp. Create an index on one of the columns to be compared, for
example, lastname. This is important for performance. Or, create a
multi-column unique index on the temporary table, Temp. When using a
multi-column unique index, the columns used in the index must be
specifically defined as NOT NULL.
3. Use the INSERT command to move the data from the original table to
the temporary table.
INSERT INTO temp SELECT * FROM employee
As a row is added, the rule or index is checked. If a row already has
been added, the rule or index check fails and the duplicate row is not
loaded. There is no way to count how many rows fail or to see what
those rows are. The first row is loaded and subsequent rows are
automatically thrown out. The table Temp then contains only one row
with a particular last name, first name, street address combination,
for example.
Not really duplicates
Often duplicate data printed in a report is not really duplicated.
The report may be based on a view and the view joins the tables in
such a way that the duplicate data is created in the view but does not
exist in the underlying tables. Duplicate data can appear in a view
any time tables that have a many-to-many relationship are joined.
Some of the columns are always duplicated when tables with a one-to-
many relationship are joined.
Look at an example from the CONCOMP sample database. The Salesbonus
table has many rows for each employee. The column empid is used to
link with other tables, but does not uniquely identify a row.
empid transdate netamount bonuspct bonus
------ --------- ----------- -------- ---------------
102 01/12/95 $176000.00 0.003 $528.00
102 02/27/95 $87500.00 0.002 $175.00
102 02/28/95 $22500.00 0. $0.00
102 03/01/95 $40500.00 0.001 $40.50
129 01/10/95 $76800.00 0.002 $153.60
129 02/23/95 $36625.00 0.001 $36.63
129 03/07/95 $56250.00 0.002 $112.50
133 01/03/95 $27000.00 0.001 $27.00
160 01/09/95 $9500.00 0. $0.00
160 02/23/95 $210625.00 0.003 $631.88
The Employee table has one row for each employee. The empid column
does uniquely identify a row.
empid emptitle empfname emplname empaddr
----- -------------- -------- ---------- ------------------
102 Manager June Wilson 3278 Summit Drive
129 Manager Ernest Hernandez 12390 Windermere Dr.
133 Representative Peter Coffin 4105 29th Ave N.E.
160 Representative Mary Simpson 101 West Mercer
165 Representative Darnell Williams 8806 88th Street
166 Sales Clerk John Chou 5001 Main Street
167 Representative Sandi Watson 1002 S. Front Ave.
When the two tables are joined
(Employee.empid=Salesbonus.empid), the result displays as many rows
for each employee (empid) as there are in the Salesbonus table. The
data selected from the Employee table is repeated, but the data from
the Salesbonus table is unique.
data from data from
Employee Salesbonus
/\ / \
empid emptitle empid transdate netamount
----- -------------- ----- --------- -----------
102 Manager 102 01/12/95 $176000.00
102 Manager 102 02/27/95 $87500.00
102 Manager 102 02/28/95 $22500.00
102 Manager 129 01/10/95 $76800.00
129 Manager 129 02/23/95 $36625.00
129 Manager 129 03/07/95 $56250.00
133 Representative 133 01/03/95 $27000.00
160 Representative 160 01/09/95 $9500.00
160 Representative 160 02/23/95 $210625.00
| |
| |
+--------------+------+
|
the data is linked together by the empid column
One row is created in the view for each row in the many table
(Salesbonus) that matches a row in the one table (Employee).
What happens if this view is now linked with the Transmaster table?
The Transmaster table also includes the column empid, but like the
Salesbonus table, can have many rows for each employee.
transid custid empid transdate netamount freight
------- ------- ------ --------- ----------- ---------
4760 100 133 01/03/95 $27000.00 $270.00
4780 105 160 09/09/95 $9500.00 $95.00
4790 104 129 01/10/95 $63000.00 $630.00
4795 101 102 01/12/95 $176000.00 $1760.00
4800 105 160 09/23/95 $167250.00 $1672.50
4865 102 129 02/23/95 $29125.00 $291.25
4975 101 102 02/27/95 $87500.00 $875.00
4980 101 102 09/28/95 $22500.00 $225.00
5000 101 102 03/01/95 $29000.00 $290.00
5050 104 129 09/07/95 $56250.00 $562.50
5060 101 102 03/08/95 $30000.00 $300.00
5065 106 160 09/14/95 $140300.00 $1403.00
5070 104 129 03/15/95 $95500.00 $955.00
5075 102 129 09/16/95 $155500.00 $1555.00
5080 100 133 03/20/95 $80000.00 $800.00
When the Employee, Salesbonus and Transmaster tables are joined by
linking the empid column, the view contains many extra rows of data.
Look at the data for employee #102 (empid=102), for example. There are
four rows for employee #102 in the Employee, Salesbonus view. Each of
these four rows joins with each of the five rows in the Transmaster
table for employee #102, making 20 rows in the resulting three-table
view.
data from data from data from
Employee Salesbonus Transmaster
/ \ / \ /\
empid emptitle transdate netamount transid custid empid transdate
----- -------- --------- ---------- ------- ------ ----- ---------
102 Manager 01/12/95 $176000.00 4795 101 102 01/12/95
102 Manager 02/27/95 $87500.00 4795 101 102 01/12/95
102 Manager 02/28/95 $22500.00 4795 101 102 01/12/95
102 Manager 03/01/95 $40500.00 4795 101 102 01/12/95
102 Manager 01/12/95 $176000.00 4975 101 102 02/27/95
102 Manager 02/27/95 $87500.00 4975 101 102 02/27/95
102 Manager 02/28/95 $22500.00 4975 101 102 02/27/95
102 Manager 03/01/95 $40500.00 4975 101 102 02/27/95
102 Manager 01/12/95 $176000.00 4980 101 102 09/28/95
102 Manager 02/27/95 $87500.00 4980 101 102 09/28/95
102 Manager 02/28/95 $22500.00 4980 101 102 09/28/95
102 Manager 03/01/95 $40500.00 4980 101 102 09/28/95
This makes a lot of duplicated data in our final view. Only a few of
the columns are unique to a row. If we don't select all the columns,
we may think we have duplicate rows. By using another column,
transdate, to link the Transmaster table to the Salesbonus table, the
view changes to a one-many-one relationship instead of a one-many-
many.
If you suspect there are extra rows in your view due to a many-to-
many link, check it by selecting the data from the view and then from
the underlying tables. With the above example, use the following
commands to count the rows for a particular empid. The counts show
that there are indeed extra rows in our view.
R>SELECT COUNT(*) FROM employee WHERE empid=102
COUNT (*)
----------
1
R>SELECT COUNT(*) FROM salesbonus WHERE empid=102
COUNT (*)
----------
4
R>SELECT COUNT(*) FROM transmaster WHERE empid=102
COUNT (*)
----------
5
R>SELECT COUNT(*) FROM bonus_view WHERE empid=102
COUNT (*)
----------
20
It's much easier to detect this type of problem by picking a specific
data value to test. How do you solve this? Look at your table
structure. There may be another column that can used to link the
tables to create a one relationship rather than a many relationship.
Identify the columns that uniquely identify a row in each table. In
the example above, the empid column does not uniquely identify a row
in the Salesbonus table, rather both the empid and transdate columns
must be used to uniquely identify a row. The transdate column can be
used with the empid column to link the Salesbonus table with the
Transmaster table removing the many-to-many relationship from the
three table view. As with all SQL commands, when creating views it's
best to work with a small sample of data so you can easily verify the
results.