""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
    DELETING DUPLICATE ROWS BASED ON A SUBSET OF COLUMNS
    """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
    DATE      :  9/87                    NUMBER       :  87-9-1
    PRODUCT   :  R:BASE SYSTEM V         VERSIONS     :  ALL
    CATEGORY  :  DELETE                  SUBCATEGORY  :  DUPES/MAINTENANCE
    """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
    NEED:
    """"
    I need to delete duplicate rows in my customer table (CUST) based on a
    combination of only three columns (IDNO, LASTNAME, and ZIP) rather
    than all 20 columns in the table.  The DELETE DUPLICATE command looks
    at all the columns.   This does not work for me because some of the
    other columns may not be duplicated.
 
 
    SOLUTION:
    """"""""
    Here are three methods you can use to delete duplicates based on a
    subset of columns.  Choose the one that best fits your needs.
 
    The first method (Print Report Then Load with FileGateway) is the
    fastest.  It also allows you to set up criteria, such as "keep the
    most recent row," to tell R:BASE which of the duplicates to keep.
    This method prints all the data into a file with an R:BASE report, so
    you must be able to fit each of your rows onto a 255-character, fixed-
    format line.
 
    The second method (Unload Then Load with Rules) also allows you to set
    up some criteria to determine which row to keep.  This method takes
    longer than the first method because it requires that you add a
    computed column to the table with the EXPAND command, and it checks
    rules during the load process.  However, there is no limit to the
    length of your row.  If your rows will not fit into a 255-character,
    fixed-format line, and you need to keep only the latest row, use this
    second method.
 
    The third method (APPEND with RULES) does not require the creation of
    a file.  Instead, it uses RULES to prevent duplicates from getting
    into the data during an APPEND.  Use this third method only if it does
    not matter which duplicate is kept.
 
    Methods two and three use the EXPAND and REMOVE COLUMN commands; make
    sure you have enough disk space.
 
 
    METHOD ONE: PRINT REPORT THEN LOAD WITH FILEGATEWAY
 
    This is the fastest method.  Use this method if your rows can be
    printed in 255 characters or less, you have 10 or less columns in the
    subset, and you have enough room on your disk to hold a file that
    contains all the nonduplicate rows in the table.  This method uses a
 
 
 
    MICRORIM ONLINE October 1987 --------------------------- Page 1
 
 
 
 
 
    report which breaks on all the columns in the subset as a filter so
    that only one row from each set of duplicates is printed into a file.
    Then, the original table is emptied and FileGateway is used to load
    the formatted duplicate-free file into the now-empty original table.
    A report can contain up to 10 breakpoints so you can use this method
    if your subset of columns is 10 or less.
 
    1.  Backup the database.  You need to have a good backup on hand
      because this method has you delete all the rows in your table.  If
      anything goes wrong, you want to be able to recover.
 
    2.  Define a one-line report based on your table.  Choose Configure
      and set lines per page to zero.  Enter the first column in your
      subset of columns as the first break column, the second column as
      the second break column, the third as the third break, and so on for
      all columns in the subset.  Next, press [ESC] to return to the main
      Reports EXPRESS menu and choose Edit.  Locate all the columns in the
      original table in order (left to right) on the top line of the
      report and mark the line as a break footer section for the last
      break.  For example, if there are three columns in the subset (and
      three breaks), mark the single line as an F3 section; if there are
      nine columns, mark the single line as an F9 section.  By breaking on
      all the columns in the subset, and only printing the footer, you
      guarantee the printing of only one of the duplicates.
 
    3.  Use the following commands to print all the non-duplicate rows
      from the table into a file and then delete all the original rows.
 
         SET NULL -0-
         OUTPUT nondupe.dat
         PRINT rptname
         OUTPUT SCREEN
         *( Now delete all the rows )
         DELETE ROWS FROM orig WHERE LIMIT = 999999999
         *( Now, regain the space by using the RELOAD or PACK commands )
         *( Use FileGateway to load the fixed-format file into ORIG )
 
      If you want to ensure that the row that is printed is the latest
      row, add a SORTED BY clause to the PRINT command.  For example, if
      you are storing the date in a DATE column named INDATE and you want
      to make sure that the row printed into the file is the latest row,
      sort the data so that the last row in every duplicate group is the
      latest row.  The last row in the duplicate group is printed in the
      break footer.  If you sort by INDATE, you need to also include all
      the break columns in the SORTED BY clause.  You can only have a
      maximum of 10 sort columns so if you need to sort by a DATE column
      you can only fit a maximum of nine columns in the subset.  For
      example, if you have five breaks, revise the PRINT command to read
      as follows:
 
         PRINT rptname SORTED BY break1 break2 break3 break4 break5 indate
 
      where BREAK1, BREAK2, BREAK3, etc. are the names of the break
      columns.
 
    4.  Use FileGateway to load the fixed-format file (NONDUPE.DAT) into
      the now empty original table.  Set rule off during the load process,
 
 
 
    MICRORIM ONLINE October 1987 --------------------------- Page 2
 
 
 
 
 
      they are not needed.
 
 
    METHOD TWO: UNLOAD THEN LOAD WITH RULES
 
    The first four steps for the second and third methods are the same:
    create a good backup of your database, add a computed column to the
    original table that is the concatenation of all the columns in the
    subset, make an empty table (EMPTY) that has all the columns from the
    original table (ORIG) by using the PROJECT command with the WHERE
    LIMIT = 0 clause, and set up a rule to prevent duplicate entries into
    the EMPTY table.  In both methods the empty table ends up loaded with
    the only the nonduplicates.  The original table is then removed and
    the no-longer-empty table is renamed to the original name.
 
    Use this method if your rows are longer than 255 characters and it
    matters which duplicate row is kept.  This method unloads the data
    into a file making sure that the row you want to keep is the first row
    in every duplicate group.  Then, it uses a rule that allows only the
    first row back in.  For example, the method outlined below keeps the
    most recent row, based on a DATE column (INDATE), by unloading the
    data sorted in descending order and then loading it back in with a
    rule.
 
    1.  Backup the database.  During this process you will be modifying
      the database structure and removing an entire table from the
      database.  If anything goes wrong, you want to be able to recover.
 
    2.  Use the EXPAND command to add a computed TEXT column (CHECKCOL) to
      the original table (ORIG) that is the result of concatenating all
      the columns in the subset together.  This computed column will be
      used to prevent duplicates from being loaded to the empty table you
      will create in step three.  Be sure that you have enough available
      disk space for R:BASE to make a copy of your original table during
      the EXPAND process.  After the EXPAND process is complete, you may
      want to use the RELOAD or PACK command to regain disk space.  If
      some of the columns in the subset are not TEXT, use the CTXT
      function, within the computed column expression, to convert nonTEXT
      items to TEXT for the concatenation.  Concatenation only works with
      TEXT.  For example, if the columns in your subset are IDNO
      (INTEGER), LASTNAME (TEXT 20), and ZIP (TEXT 10) and the longest
      IDNO in your table is six characters, use the following EXPAND
      command to add the computed column (CHECKCOL):
 
         EXPAND orig WITH checkcol=((CTXT(idno))+lastname+zip) TEXT 36
 
      If IDNO had been TEXT 6 to begin with, the EXPAND command would have
      been:
 
         EXPAND orig WITH checkcol=(idno+lastname+zip) TEXT 36
 
      The number on the end is obtained by adding up all the lengths of
      the subset columns (6 + 20 + 10).
 
    3.  Create an empty table (EMPTY) that has the same structure as the
      original table (ORIG), change the definition of CHECKCOL in EMPTY
      from a computed TEXT column to a normal TEXT column, and key it:
 
 
 
    MICRORIM ONLINE October 1987 --------------------------- Page 3
 
 
 
 
 
 
         PROJECT empty FROM orig USING ALL WHERE LIMIT = 0
         REDEFINE checkcol TO TEXT IN empty
 
 
    4.  Define a rule to prevent duplicate CHECKCOL values from being
      loaded to EMPTY with these commands:
 
         DEFINE
         RULES
         " " checkcol IN empty NEA checkcol IN empty
         END
 
    5.  Use the commands listed below to unload the data from ORIG (the
      original table) into a file in descending order by a DATE column
      (INDATE), load EMPTY with the file, remove ORIG, rename EMPTY to
      ORIG, remove the extra column (CHECKCOL), and pack the database.
      Note that this method unloads all the data to a disk file and uses
      the REMOVE COLUMN command.  Both of these processes use disk space.
      Make sure you have enough room on your disk.
 
         OUTPUT temp.dat
         UNLOAD DATA FOR orig AS ASCII SORTED BY indate=D
         OUTPUT SCREEN
         SET MESSAGES OFF
         SET ERROR MESSAGES OFF
         SET BELL OFF
         SET RULES ON
         LOAD empty FROM temp.dat AS ASCII
         SET MESSAGES ON
         SET ERROR MESSAGES ON
         SET BELL ON
         REMOVE TABLE orig
         RENAME TABLE empty TO orig
         REMOVE COLUMN checkcol FROM orig
         ERASE temp.dat
         PACK
 
      While loading the data, R:BASE checks the rules and adds only the
      first row of each set of duplicates, to the EMPTY table.
 
 
    METHOD THREE:  APPEND WITH RULES
 
    Use this method if it does not matter which of the duplicate rows is
    kept.  First, repeat steps one through four from method two, then
    continue with this step five:
 
    5.  Use the following commands to append ORIG (the original table) to
      EMPTY, remove ORIG, rename EMPTY to ORIG, remove the extra column
      (CHECKCOL), and pack the database:
 
         SET MESSAGES OFF
         SET ERROR MESSAGES OFF
         SET BELL OFF
         SET RULES ON
         APPEND orig TO empty
 
 
 
    MICRORIM ONLINE October 1987 --------------------------- Page 4
 
 
 
 
 
         SET MESSAGES ON
         SET ERROR MESSAGES ON
         SET BELL ON
         REMOVE TABLE orig
         RENAME TABLE empty TO orig
         REMOVE COLUMN checkcol FROM orig
         PACK
 
      During the append process, R:BASE checks the rules and adds only the
      first row, in a set of duplicates, to the EMPTY table.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
    MICRORIM ONLINE October 1987 --------------------------- Page 5