""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DELETING DUPLICATE ROWS BASED ON A SUBSET OF COLUMNS
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DATE : 9/87 NUMBER : 87-9-1
PRODUCT : R:BASE SYSTEM V VERSIONS : ALL
CATEGORY : DELETE SUBCATEGORY : DUPES/MAINTENANCE
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
NEED:
""""
I need to delete duplicate rows in my customer table (CUST) based on a
combination of only three columns (IDNO, LASTNAME, and ZIP) rather
than all 20 columns in the table. The DELETE DUPLICATE command looks
at all the columns. This does not work for me because some of the
other columns may not be duplicated.
SOLUTION:
""""""""
Here are three methods you can use to delete duplicates based on a
subset of columns. Choose the one that best fits your needs.
The first method (Print Report Then Load with FileGateway) is the
fastest. It also allows you to set up criteria, such as "keep the
most recent row," to tell R:BASE which of the duplicates to keep.
This method prints all the data into a file with an R:BASE report, so
you must be able to fit each of your rows onto a 255-character, fixed-
format line.
The second method (Unload Then Load with Rules) also allows you to set
up some criteria to determine which row to keep. This method takes
longer than the first method because it requires that you add a
computed column to the table with the EXPAND command, and it checks
rules during the load process. However, there is no limit to the
length of your row. If your rows will not fit into a 255-character,
fixed-format line, and you need to keep only the latest row, use this
second method.
The third method (APPEND with RULES) does not require the creation of
a file. Instead, it uses RULES to prevent duplicates from getting
into the data during an APPEND. Use this third method only if it does
not matter which duplicate is kept.
Methods two and three use the EXPAND and REMOVE COLUMN commands; make
sure you have enough disk space.
METHOD ONE: PRINT REPORT THEN LOAD WITH FILEGATEWAY
This is the fastest method. Use this method if your rows can be
printed in 255 characters or less, you have 10 or less columns in the
subset, and you have enough room on your disk to hold a file that
contains all the nonduplicate rows in the table. This method uses a
MICRORIM ONLINE October 1987 --------------------------- Page 1
report which breaks on all the columns in the subset as a filter so
that only one row from each set of duplicates is printed into a file.
Then, the original table is emptied and FileGateway is used to load
the formatted duplicate-free file into the now-empty original table.
A report can contain up to 10 breakpoints so you can use this method
if your subset of columns is 10 or less.
1. Backup the database. You need to have a good backup on hand
because this method has you delete all the rows in your table. If
anything goes wrong, you want to be able to recover.
2. Define a one-line report based on your table. Choose Configure
and set lines per page to zero. Enter the first column in your
subset of columns as the first break column, the second column as
the second break column, the third as the third break, and so on for
all columns in the subset. Next, press [ESC] to return to the main
Reports EXPRESS menu and choose Edit. Locate all the columns in the
original table in order (left to right) on the top line of the
report and mark the line as a break footer section for the last
break. For example, if there are three columns in the subset (and
three breaks), mark the single line as an F3 section; if there are
nine columns, mark the single line as an F9 section. By breaking on
all the columns in the subset, and only printing the footer, you
guarantee the printing of only one of the duplicates.
3. Use the following commands to print all the non-duplicate rows
from the table into a file and then delete all the original rows.
SET NULL -0-
OUTPUT nondupe.dat
PRINT rptname
OUTPUT SCREEN
*( Now delete all the rows )
DELETE ROWS FROM orig WHERE LIMIT = 999999999
*( Now, regain the space by using the RELOAD or PACK commands )
*( Use FileGateway to load the fixed-format file into ORIG )
If you want to ensure that the row that is printed is the latest
row, add a SORTED BY clause to the PRINT command. For example, if
you are storing the date in a DATE column named INDATE and you want
to make sure that the row printed into the file is the latest row,
sort the data so that the last row in every duplicate group is the
latest row. The last row in the duplicate group is printed in the
break footer. If you sort by INDATE, you need to also include all
the break columns in the SORTED BY clause. You can only have a
maximum of 10 sort columns so if you need to sort by a DATE column
you can only fit a maximum of nine columns in the subset. For
example, if you have five breaks, revise the PRINT command to read
as follows:
PRINT rptname SORTED BY break1 break2 break3 break4 break5 indate
where BREAK1, BREAK2, BREAK3, etc. are the names of the break
columns.
4. Use FileGateway to load the fixed-format file (NONDUPE.DAT) into
the now empty original table. Set rule off during the load process,
MICRORIM ONLINE October 1987 --------------------------- Page 2
they are not needed.
METHOD TWO: UNLOAD THEN LOAD WITH RULES
The first four steps for the second and third methods are the same:
create a good backup of your database, add a computed column to the
original table that is the concatenation of all the columns in the
subset, make an empty table (EMPTY) that has all the columns from the
original table (ORIG) by using the PROJECT command with the WHERE
LIMIT = 0 clause, and set up a rule to prevent duplicate entries into
the EMPTY table. In both methods the empty table ends up loaded with
the only the nonduplicates. The original table is then removed and
the no-longer-empty table is renamed to the original name.
Use this method if your rows are longer than 255 characters and it
matters which duplicate row is kept. This method unloads the data
into a file making sure that the row you want to keep is the first row
in every duplicate group. Then, it uses a rule that allows only the
first row back in. For example, the method outlined below keeps the
most recent row, based on a DATE column (INDATE), by unloading the
data sorted in descending order and then loading it back in with a
rule.
1. Backup the database. During this process you will be modifying
the database structure and removing an entire table from the
database. If anything goes wrong, you want to be able to recover.
2. Use the EXPAND command to add a computed TEXT column (CHECKCOL) to
the original table (ORIG) that is the result of concatenating all
the columns in the subset together. This computed column will be
used to prevent duplicates from being loaded to the empty table you
will create in step three. Be sure that you have enough available
disk space for R:BASE to make a copy of your original table during
the EXPAND process. After the EXPAND process is complete, you may
want to use the RELOAD or PACK command to regain disk space. If
some of the columns in the subset are not TEXT, use the CTXT
function, within the computed column expression, to convert nonTEXT
items to TEXT for the concatenation. Concatenation only works with
TEXT. For example, if the columns in your subset are IDNO
(INTEGER), LASTNAME (TEXT 20), and ZIP (TEXT 10) and the longest
IDNO in your table is six characters, use the following EXPAND
command to add the computed column (CHECKCOL):
EXPAND orig WITH checkcol=((CTXT(idno))+lastname+zip) TEXT 36
If IDNO had been TEXT 6 to begin with, the EXPAND command would have
been:
EXPAND orig WITH checkcol=(idno+lastname+zip) TEXT 36
The number on the end is obtained by adding up all the lengths of
the subset columns (6 + 20 + 10).
3. Create an empty table (EMPTY) that has the same structure as the
original table (ORIG), change the definition of CHECKCOL in EMPTY
from a computed TEXT column to a normal TEXT column, and key it:
MICRORIM ONLINE October 1987 --------------------------- Page 3
PROJECT empty FROM orig USING ALL WHERE LIMIT = 0
REDEFINE checkcol TO TEXT IN empty
4. Define a rule to prevent duplicate CHECKCOL values from being
loaded to EMPTY with these commands:
DEFINE
RULES
" " checkcol IN empty NEA checkcol IN empty
END
5. Use the commands listed below to unload the data from ORIG (the
original table) into a file in descending order by a DATE column
(INDATE), load EMPTY with the file, remove ORIG, rename EMPTY to
ORIG, remove the extra column (CHECKCOL), and pack the database.
Note that this method unloads all the data to a disk file and uses
the REMOVE COLUMN command. Both of these processes use disk space.
Make sure you have enough room on your disk.
OUTPUT temp.dat
UNLOAD DATA FOR orig AS ASCII SORTED BY indate=D
OUTPUT SCREEN
SET MESSAGES OFF
SET ERROR MESSAGES OFF
SET BELL OFF
SET RULES ON
LOAD empty FROM temp.dat AS ASCII
SET MESSAGES ON
SET ERROR MESSAGES ON
SET BELL ON
REMOVE TABLE orig
RENAME TABLE empty TO orig
REMOVE COLUMN checkcol FROM orig
ERASE temp.dat
PACK
While loading the data, R:BASE checks the rules and adds only the
first row of each set of duplicates, to the EMPTY table.
METHOD THREE: APPEND WITH RULES
Use this method if it does not matter which of the duplicate rows is
kept. First, repeat steps one through four from method two, then
continue with this step five:
5. Use the following commands to append ORIG (the original table) to
EMPTY, remove ORIG, rename EMPTY to ORIG, remove the extra column
(CHECKCOL), and pack the database:
SET MESSAGES OFF
SET ERROR MESSAGES OFF
SET BELL OFF
SET RULES ON
APPEND orig TO empty
MICRORIM ONLINE October 1987 --------------------------- Page 4
SET MESSAGES ON
SET ERROR MESSAGES ON
SET BELL ON
REMOVE TABLE orig
RENAME TABLE empty TO orig
REMOVE COLUMN checkcol FROM orig
PACK
During the append process, R:BASE checks the rules and adds only the
first row, in a set of duplicates, to the EMPTY table.
MICRORIM ONLINE October 1987 --------------------------- Page 5