Handling Duplicate Data
     You print the month-end report and see two records for account #102. 
     You've merged databases and need to check for duplicates. You have 
     both John Smith and John C. Smith at the same address. Are they the 
     same person? You have a record for John Smith with a 5-digit zipcode 
     and a record for John Smith with a 10-digit zipcode. All of these 
     situations involve what may be duplicate records. What can you do to 
     find the records involved and delete the duplicates?
 
     Browsing the data is one way, but not if you have 10,000 records, and 
     some situations require human interaction. For example, the computer 
     cannot make the decision that John Smith and John C. Smith are the 
     same person. The computer can, however, help you find the records 
     involved and present them so that the data is easy to view and the 
     necessary decisions can be made.
 
     The R:BASE DELETE DUPLICATES command deletes identical records from a 
     table. The command searches and compares records character by 
     character, and if two records are identical, will delete the second 
     record. Because the command compares character by character, the 
     command is not fast. And often, records are not exactly identical. 
     Two spaces between words instead of one means records are not 
     identical. An M instead of an N means records are not identical. Most 
     situations involve data where human interaction is needed to say 
     "Yes, these records are the same" or where the comparison needs to be 
     made on one or two or three columns, not the entire row. In R:BASE 
     4.5 Plus! and higher, you can specify a column list with the DELETE 
     DUPLICATES command to check for duplication on the specified columns 
     only. For other situations, you need to find, and actually look at, 
     the duplicated rows.
 
     Preventing Duplicates
     Use R:BASE constraints and data entry rules to prevent duplicate data 
     from being entered. A primary key constraint requires that the 
     specified column or columns be unique. You can also create a rule that 
     requires data entered in a column to be unique. The primary key 
     constraint or the rule is easily created through the menus. The menus 
     prompt for the table and column name that must be unique. Another 
     option is to define a unique index for a column. Each of these 
     techniques ensures that duplicate data is not entered into a column.
 
     To require uniqueness across a group of columns, you define a 
     multi-column primary key, a multi-column uniqueness rule, or a 
     multi-column unique index. The multi-column primary key can be defined 
     through the menu system in either R:BASE 4.5++ or R:BASE 5.1. The 
     multi-column unique index can be defined through the menu system in 
     R:BASE 5.1, but must be defined from the R> prompt in 4.5++. A 
     multi-column uniqueness rule is defined through the custom rule 
     option.
 
     The easiest way to define a multi-column uniqueness rule is to start 
     with the default single column uniqueness rule. Create a single column 
     uniqueness rule picking the appropriate table and one of the columns 
     that is part of the unique group. Then select to modify the rule just 
     created. Add to the WHERE clause comparison conditions for the other 
     columns that make up the unique group. Follow the same structure as 
     for the default rule.
 
     For example, here is the WHERE clause from a default rule requiring 
     the employee last name to be unique: 
 
     Employee.emplname IS NOT NULL AND Employee.emplname +
       NOT IN (SELECT emplname FROM Employee #T1 +
               WHERE #T1.emplname = Employee.emplname)
 
     Since last name is not unique be itself, the rule is modified to add 
     the first name and address columns to the condition list. The three 
     columns together, last name, first name, and address, make a unique 
     set.
 
     Employee.emplname IS NOT NULL AND Employee.emplname +
       NOT IN (SELECT emplname FROM Employee #T1 +
               WHERE #T1.emplname = Employee.emplname +
               AND #T1.empfname = Employee.empfname AND +
              #T1.empaddress = Employee.empaddress)
 
     Finding Duplicates
     Ever wonder why a company sends you two or three pieces of the same 
     mailing? All with slightly different names? In this age of computers 
     why can't they figure out that you're all the same person? Well, the 
     computer can't make the decision that Smith and Smith are the same 
     person, only a human can do that. What the computer can do is help 
     identify possible duplicate records.
 
     R:BASE's constraints and data entry rules can keep duplicate data from 
     being entered, but constraints and rules don't prevent misspellings or 
     miskeying of data. An SQL SELECT command is used to search data for 
     possible duplicates. Notice that the name columns are not used in the 
     SELECT command to check for duplicates. We want to search for possibly 
     misspelled names so we don't use name as one of our criteria.
 
     SELECT empid, COUNT(*)....the column, empid, uniquely identifies a row
     FROM employee......      this is the table, Employee, to check for 
     duplicates 
     WHERE city IN............one of the columns used in the duplicate 
     comparison
     (SELECT city...........a Sub-SELECT is used to create a comparison 
     list
       FROM employee T1
       GROUP BY state, city, address...group the data together by the 
       columns used to check for duplicates
       HAVING COUNT(*) > 1).......  ...count the rows in the group, only 
       return a result if the same state, city, and address occur more than 
       once GROUP BY empid...........display the id and count only once for 
       each id 
 
     This SELECT command displays the employee ID and the number of times 
     it's found with the same state, city and street address. If any 
     records are found, there may be duplicate data where the name or the 
     zipcode was misspelled or miskeyed. Now those records can be displayed 
     for editing and someone can review them and decide if they are indeed 
     duplicates and which one to keep. 
 
     Keep the same WHERE clause, but replace the SELECT command with a 
     command such as EDIT that allows viewing and deleting of the data:
 
      EDIT * FROM employee +
      WHERE city IN +
          (SELECT city +
           FROM employee T1 +
           GROUP BY state, city, address +
           HAVING COUNT(*) > 1) +
      ORDER BY state, city, address, lastname
 
     Add the ORDER BY clause so when the data is displayed, the possible 
     duplicate records are next to each other. Depending on the size of the 
     database, this can be a large task. But if you're serious about 
     finding and removing duplicates, it needs to be done.
 
     Deleting duplicates
     Once the potential duplicates are found, a person can review the data 
     and then delete the duplicate data. This way there is a choice over 
     which record to keep. The computer can be programmed to do this task, 
     but it will always keep the first record found.
 
     The easiest way to have the computer delete duplicates, keeping the 
     first record, is to use the DELETE DUPLICATES command with the 
     optional USING clause. The USING clause specifies the columns to check 
     for uniqueness so the computer does not need to check the entire row. 
     For example, the check for duplicate employee records you might use 
     this command:
 
     DELETE DUPLICATES FROM employee +
       USING ELastName, EFirstName, EAddress
 
     R:BASE checks only the specified columns, ELastName, EFirstName, 
     EAddress, for duplicate data making for much faster processing time 
     than if R:BASE had to check the entire row. With the optional USING 
     clause, the DELETE DUPLICATES command is the fastest way to delete 
     duplicate rows of data from a table. This option is available in 
     R:BASE version 4.5 Plus! and higher.
 
     An alternative method involves creating a temporary table and using a 
     multi-column unique index or a multi-column uniqueness rule. This 
     method works well for versions of R:BASE prior to R:BASE 4.5 Plus!.
 
     1. Create a temporary table with the same structure as the table that 
     has the duplicate records.
 
     PROJECT temp FROM employee USING ALL WHERE LIMIT=0
 
     2. Create a multi-column uniqueness rule referencing the temporary 
     table, Temp. Create an index on one of the columns to be compared, for 
     example, lastname. This is important for performance. Or, create a 
     multi-column unique index on the temporary table, Temp. When using a 
     multi-column unique index, the columns used in the index must be 
     specifically defined as NOT NULL.
 
     3. Use the INSERT command to move the data from the original table to 
     the temporary table.
 
     INSERT INTO temp SELECT * FROM employee
 
     As a row is added, the rule or index is checked. If a row already has 
     been added, the rule or index check fails and the duplicate row is not 
     loaded. There is no way to count how many rows fail or to see what 
     those rows are. The first row is loaded and subsequent rows are 
     automatically thrown out. The table Temp then contains only one row 
     with a particular last name, first name, street address combination, 
     for example.
 
     Not really duplicates
     Often duplicate data printed in a report is not really duplicated. 
     The report may be based on a view and the view joins the tables in 
     such a way that the duplicate data is created in the view but does not 
     exist in the underlying tables. Duplicate data can appear in a view 
     any time tables that have a many-to-many relationship are joined. 
     Some of the columns are always duplicated when tables with a one-to-
     many relationship are joined.
 
     Look at an example from the CONCOMP sample database. The Salesbonus 
     table has many rows for each employee. The column empid is used to 
     link with other tables, but does not uniquely identify a row.
 
     empid  transdate netamount  bonuspct bonus
     ------ --------- ----------- -------- ---------------
     102    01/12/95  $176000.00  0.003    $528.00
     102    02/27/95   $87500.00  0.002    $175.00
     102    02/28/95   $22500.00  0.         $0.00
     102    03/01/95   $40500.00  0.001     $40.50
     129    01/10/95   $76800.00  0.002    $153.60
     129    02/23/95   $36625.00  0.001     $36.63
     129    03/07/95   $56250.00  0.002    $112.50
     133    01/03/95   $27000.00  0.001     $27.00
     160    01/09/95    $9500.00  0.         $0.00
     160    02/23/95  $210625.00  0.003    $631.88
 
     The Employee table has one row for each employee. The empid column 
     does uniquely identify a row. 
 
      empid emptitle       empfname emplname    empaddr
      ----- -------------- -------- ----------  ------------------
      102  Manager         June     Wilson      3278 Summit Drive
      129  Manager         Ernest   Hernandez   12390 Windermere Dr.
      133  Representative  Peter    Coffin      4105 29th Ave N.E.
      160  Representative  Mary     Simpson     101 West Mercer 
      165  Representative  Darnell  Williams    8806 88th Street
      166  Sales Clerk     John     Chou        5001 Main Street 
      167  Representative  Sandi    Watson      1002 S. Front Ave. 
      
      When the two tables are joined 
 
     (Employee.empid=Salesbonus.empid), the result displays as many rows 
     for each employee (empid) as there are in the Salesbonus table. The 
     data selected from the Employee table is repeated, but the data from 
     the Salesbonus table is unique.
 
            data from             data from
            Employee              Salesbonus
                /\                    / \
      empid emptitle       empid transdate netamount
      ----- -------------- ----- --------- -----------
      102   Manager        102  01/12/95  $176000.00
      102   Manager        102  02/27/95  $87500.00
      102   Manager        102  02/28/95  $22500.00
      102   Manager        129  01/10/95  $76800.00
      129   Manager        129  02/23/95  $36625.00
      129   Manager        129  03/07/95  $56250.00
      133   Representative 133  01/03/95  $27000.00
      160   Representative 160  01/09/95  $9500.00
      160   Representative 160  02/23/95  $210625.00
      |                     |
      |                     |
      +--------------+------+
      |
      the data is linked together by the empid column 
 
     One row is created in the view for each row in the many table 
     (Salesbonus) that matches a row in the one table (Employee).
 
     What happens if this view is now linked with the Transmaster table? 
     The Transmaster table also includes the column empid, but like the 
     Salesbonus table, can have many rows for each employee.
 
      transid custid  empid  transdate netamount  freight
      ------- ------- ------ --------- ----------- ---------
      4760    100     133    01/03/95   $27000.00   $270.00
      4780    105     160    09/09/95    $9500.00    $95.00
      4790    104     129    01/10/95   $63000.00   $630.00
      4795    101     102    01/12/95  $176000.00  $1760.00
      4800    105     160    09/23/95  $167250.00  $1672.50
      4865    102     129    02/23/95   $29125.00   $291.25
      4975    101     102    02/27/95   $87500.00   $875.00
      4980    101     102    09/28/95   $22500.00   $225.00
      5000    101     102    03/01/95   $29000.00   $290.00
      5050    104     129    09/07/95   $56250.00   $562.50
      5060    101     102    03/08/95   $30000.00   $300.00
      5065    106     160    09/14/95  $140300.00  $1403.00
      5070    104     129    03/15/95   $95500.00   $955.00
      5075    102     129    09/16/95  $155500.00  $1555.00
      5080    100     133    03/20/95   $80000.00   $800.00
 
     When the Employee, Salesbonus and Transmaster tables are joined by 
     linking the empid column, the view contains many extra rows of data. 
     Look at the data for employee #102 (empid=102), for example. There are 
     four rows for employee #102 in the Employee, Salesbonus view. Each of 
     these four rows joins with each of the five rows in the Transmaster 
     table for employee #102, making 20 rows in the resulting three-table 
     view.
 
       data from          data from               data from
       Employee           Salesbonus             Transmaster
          /  \               /     \              /\
      empid emptitle  transdate netamount  transid custid empid transdate
      ----- --------  --------- ---------- ------- ------ ----- ---------
      102   Manager   01/12/95 $176000.00   4795   101    102   01/12/95 
      102   Manager   02/27/95  $87500.00   4795   101    102   01/12/95
      102   Manager   02/28/95  $22500.00   4795   101    102   01/12/95
      102   Manager   03/01/95  $40500.00   4795   101    102   01/12/95 
      102   Manager   01/12/95 $176000.00   4975   101    102   02/27/95 
      102   Manager   02/27/95  $87500.00   4975   101    102   02/27/95
      102   Manager   02/28/95  $22500.00   4975   101    102   02/27/95
      102   Manager   03/01/95  $40500.00   4975   101    102   02/27/95
      102   Manager   01/12/95 $176000.00   4980   101    102   09/28/95 
      102   Manager   02/27/95  $87500.00   4980   101    102   09/28/95
      102   Manager   02/28/95  $22500.00   4980   101    102   09/28/95
      102   Manager   03/01/95  $40500.00   4980   101    102   09/28/95
      
     This makes a lot of duplicated data in our final view. Only a few of 
     the columns are unique to a row. If we don't select all the columns, 
     we may think we have duplicate rows. By using another column, 
     transdate, to link the Transmaster table to the Salesbonus table, the 
     view changes to a one-many-one relationship instead of a one-many-
     many.
 
     If you suspect there are extra rows in your view due to a many-to-
     many link, check it by selecting the data from the view and then from 
     the underlying tables. With the above example, use the following 
     commands to count the rows for a particular empid. The counts show 
     that there are indeed extra rows in our view.
 
     R>SELECT COUNT(*) FROM employee WHERE empid=102
 
      COUNT (*)
      ----------
               1
 
     R>SELECT COUNT(*) FROM salesbonus WHERE empid=102
      COUNT (*)
      ----------
               4
 
     R>SELECT COUNT(*) FROM transmaster WHERE empid=102
      COUNT (*)
      ----------
               5
 
     R>SELECT COUNT(*) FROM bonus_view WHERE empid=102
      COUNT (*)
      ----------
              20
 
     It's much easier to detect this type of problem by picking a specific 
     data value to test. How do you solve this? Look at your table 
     structure. There may be another column that can used to link the 
     tables to create a one relationship rather than a many relationship. 
     Identify the columns that uniquely identify a row in each table. In 
     the example above, the empid column does not uniquely identify a row 
     in the Salesbonus table, rather both the empid and transdate columns 
     must be used to uniquely identify a row. The transdate column can be 
     used with the empid column to link the Salesbonus table with the 
     Transmaster table removing the many-to-many relationship from the 
     three table view. As with  all SQL commands, when creating views it's 
     best to work with a small sample of data so you can easily verify the 
     results.