=======================================================================
     SEARCHING TEXT COLUMNS
     =======================================================================
     PRODUCT:  R:BASE          VERSION :  3.1A Or Higher
     =======================================================================
     AREA   :  PROGRAMMING     CATEGORY:  MANIPULATION       DOCUMENT#: 665
     =======================================================================
 
 
     One function useful in almost every database is the ability to search
     on a text column, for example, lastname.  R:BASE provides a number of
     different methods to do this.  In this edition of the Exchange, an
     article by Royal Harris, "Putting IHASH To Work",  describes a method
     that worked well for him, but may not suit your database or
     application.  How do you decide what method is best for you?  There
     are a number of factors to consider, the two most important being ease 
     of use and performance.
 
     Ease of Use
     -----------
     Ease of use is very subjective.  What appears easy to one person may
     appear incredibly convoluted and complex to another.  This is an area
     best left to the individual developer's discretion and is not
     addressed in this article.  The article touches on the amount of
     custom programming that may be required to implement a particular
     method, and adding any needed columns to the database.
 
     Performance
     -----------
     Proper use of indexing is vital to performance.  You'll want to
     review information on indexing from the Command Dictionary entry on
     Indexes, the Supercharging R:BASE booklet, and articles in the
     May/June 1992 and July/August 1991 issues of the Exchange.
 
     The single most important factor in determining the effectiveness of
     an index is the uniqueness of index values.  When R:BASE builds an
     index, it builds an entry for each unique value.  As it goes through
     the data, if it finds two rows of data with the same index value it
     builds a "multiple occurrence table" (MOT) for that index value.  A
     unique index value is faster than a value with multiple index
     occurrences.  You can have multiple occurrences of index values if
     you have more than one row with the same data value in a column, or
     if you have a computed column (such as IHASH) whose values share the
     same result.
 
     Note that we are talking about unique index values, not data values.
     An index value may or may not be the same as a data value. If your
     column datatype is INTEGER or TEXT 4, for example, your index value
     and data value will be the same.  If your column is TEXT 16, then
     your data value must be converted (hashed) into a 4-byte index
     value.  Long text columns that are indexed, then, have a higher
     probability of unique data values creating non-unique index values.
 
     RSCOPE can be used to check indexes and see how many duplicate
     (multiple ocurrence) versus unique entries there are, and how many
     times R:BASE must read the index file to find the value.  An index
     loses its effectiveness as duplicate values increase and R:BASE must
     make more reads and comparisons.
 
     
     
     
     
     
     
     
     
     Indexed columns can also affect performance when adding or changing
     data in a table.  Indexed columns must be updated when a row is added
     or when the index column value is changed.  The number of indexed
     columns in a table affects the speed with which rows are added or
     changed in the table.  Take this into account when adding special
     columns for searching.  It is faster to load and change data on a
     table with one indexed column than on a table with seven indexed
     columns.
 
     A computed column must also be updated when a new row is added to the
     table or when the data on which it is based changes. This is
     important when deciding how many, if any, computed columns you want
     to add to your database for searching purposes.  Incorporating
     indexed computed columns for searching slows the process of adding or
     changing data in the table.
 
     Methods of searching
     --------------------
     A number of different searching methods are described below with
     discussion of the pros and cons of each method.  This should help you
     pick a text searching method or combination of methods that fits your
     needs.  All the examples use a table named EMPLOYEE and a column
     named LASTNAME.
 
     Using the text column only
     --------------------------
     This is the most straightforward method.  Simply index the LASTNAME
     column and use the where clause, WHERE LASTNAME = value, for
     searching.  The equals operator (=) will use indexes.  Using other
     operators on a TEXT column with a defined length greater than 4 will
     not use indexes.
 
     The advantage of this method is it's simplicity:  no columns to add
     to the database, no extra programming, and users can easily do ad hoc
     queries through the R:BASE main menu.
 
     On a LASTNAME column, you can have a fairly effective index even
     though the data values must be hashed to a 4-byte index value. You'll
     have some duplicate index values but names are fairly well
     distributed throughout the alphabet.  On a sample 42,000 row table,
     performance was comparable to all other methods when searching for
     a unique name or a common name (such as Johnson) that occurs many
     times.  You will, of course, need to test this on your data.
 
     What this method doesn't provide is options.  You need to enter the
     complete lastname and you need to enter it exactly; if you leave off
     a letter or misspell it, nothing is found.  If you don't find the
     person did you misspell the name or are they really not there?  Also,
     if you are searching on a text column other than LASTNAME (for
     example, PART_NUMBER) you may find that just indexing the column is
     not efficient and you'll want to use an alternative method.
 
     Computed column using IHASH
     ---------------------------
     R:BASE includes a function that can be used to create an integer
     value from a text value.  The function was designed to create
     effective integer keys from long text columns.  The function, IHASH,
     converts the entire text value, or just a specified number of
     characters.
 
     
     
     
     
     
     
     Using this method is more complex that just indexing the LASTNAME
     column. First you need to add a computed column to your table using
     the IHASH function on the LASTNAME column to convert its text to
     integer values.   You can modify your table through the Create/modify
     option from the Info pulldown on the R:BASE main menu or use the
     ALTER TABLE command:
 
      ALTER TABLE employee ADD Hash_Lname=(IHASH(lastname,0)) INTEGER
 
     The IHASH function converts the entire name to integer when used
     with the parameter 0.  A different parameter converts the
     specified number of characters from the name, starting at the first
     character. For example, the parameter 7 will convert the first 7
     characters of the lastname to an integer value.  Deciding on the
     number of characters to convert can be one of the hardest things
     about using this method.  Consider the relationships expressed in the
     following chart:
 
             |  Convert FEW characters   |  Convert MORE characters
      -------+---------------------------+--------------------------
       PROS  |    less input required    |    less duplicate values
      -------+---------------------------+--------------------------
       CONS  |  greater duplicate values |     more input required
      -------+---------------------------+--------------------------
 
     After adding the computed column to your table, you need to use some
     programming commands as shown below to query that column.  Using the
     IHASH function directly in a WHERE clause won't use indexes.  First
     set a variable equal to IHASH of the value you're searching for, then
     use the variable in the WHERE clause.  When using an IHASH column for
     searching, you won't be able to do ad hoc queries from the R:BASE main
     menu .
 
      SET VAR vname = (IHASH('Smith',0))
      SEL * FROM employee WHERE Hash_Lname = .vname
 
     This method does provide greater flexibility in that you can have
     users enter anywhere from 1 character to the entire name based on the
     number of characters you specify in the IHASH function.  For example,
     add a computed column to the table that will IHASH the first four
     characters of the name.  Then, in your program, check the length that
     the user enters and if it's greater than four characters use an extra
     condition on your WHERE clause.
 
     FILLIN vname USING 'Enter lastname (at least 4 characters): '
     SET VAR vlen1=(SLEN(.vname)),vname1=(SGET(.VNAME,4,1)),+
      vhash=(IHASH(.vname1,4))
     IF vlen1 > 4 THEN
       CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
      FROM employee WHERE Hash_Lname=.vhash AND +
      (SGET(lastname,.vlen1,1))=.vname
     ELSE
    CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
      FROM employee WHERE Hash_Lname=.vhash
     ENDIF
 
     A user can enter any number of letters for use with an IHASH
     computation, but must enter at least as many characters as specified
     in the IHASH column definition or enter the full name.  If the entry
     less than the specified number of characters and less than the full
     
     
     
     
     
     
     length of the name, the correct data is not found.  For example, with
     a column defined as (IHASH(lastname,7)), entering "WILL" will not
     find "WILLIAMS", it will only find "WILL".
 
     The advantages of using a computed column with the IHASH function are
     that you can turn an inefficient TEXT index into an efficient INTEGER
     index and you can provide flexibility in searching.  Users will have
     a larger selection of names to choose from and can select the
     appropriate person from the list.  For example, entering WILLIAM will
     find WILLIAMSON, WILLIAM, and WILLIAMS if you use IHASH(lastname,7).
 
     A disadvantage of this method is that you need to add columns to
     your database.  An extra computed column can slow down data entry. If
     you are tight on disk space this may not be an option.  To determine
     how much additional disk space you'll need for an IHASH column, take
     the number of rows in the table and multiply by 4.  The answer is the
     number of bytes of disk space you'll need for the additional column.
 
     Use the SOUNDSLIKE operator
     ---------------------------
     R:BASE has an undocumented WHERE clause operator,SOUNDSLIKE, that can
     by used to search text columns.  It uses the Soundex algorithm, the
     same one that is used by the U.S. Census Bureau to store census
     information.  Like its name implies, the SOUNDSLIKE operator uses the
     Soundex algorithm to find similar sounding names.  The Soundex
     algorithm keeps the first character of the name and then converts the
     next 3 consonants to numbers using the following chart.  Vowels are
     ignored, consonants W, Y and H are ignored, double consonants are
     treated as one, and side-by-side letters with the same number count
     as one.  Empty spaces at the end of the code are filled with zeros.
 
      the number    represents the letters
      ----------    ----------------------
      1          B P F V
      2          C S K G J Q X Z
      3          D T
      4          L
      5          M N
      6          R
 
     For example, the name GRAHAM converts to the code G650.  "G" for the
     first letter, 6 for the "R", the "AHA" is ignored, 5 for the "M" and
     then 0 to fill the code out to 4 characters.  WILLIAM will convert to
     the code W450, WILLIAMS to the code W452.
 
     You can use the SOUNDSLIKE operator directly in a WHERE clause:
 
    BROWSE * FROM employee WHERE lastname SOUNDSLIKE 'Graham'
 
     In addition to rows where the lastname equals Graham, this condition
     would also return rows for lastnames Grahm, Grim, Grimm, Green,
     Greene, Groome, Gorin, Gahring and many others.  But SOUNDSLIKE
     "William" will not return "Williams" because the code is different.
 
     The SOUNDSLIKE operator provides a different way of searching.  You
     don't need to spell the name exactly and it finds many similar
     sounding names.  But, as it doesn't use indexes, on a large table it
     can be slow.
 
 
     
     
     
     
     
     
     Computed column using SOUNDEX
     -----------------------------
     You don't need to use the SOUNDSLIKE operator to take advantage of
     the Soundex code.  With R:BASE 3.1C or higher you can use a UDF that
     calculates and returns the Soundex code.  Microrim includes a Soundex
     UDF with UDF pack #3, available from our Sales Department at
     800-628-6990.
 
     To use the UDF, create a computed column with the expression
     (UDF('SOUNDEX',lastname)) and datatype TEXT 4.  Index the column.  As
     with the IHASH function, the UDF function can't be used directly in a
     WHERE clause.  You need to first convert the value then use the
     converted value in the WHERE clause so it will use indexes.  For
     example,
 
      FILLIN vname1 USING 'Enter lastname: '
      SET VAR vname2=(UDF('SOUNDEX',.vname))
      CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
        FROM employee WHERE scode=.vname2
 
     As with the SOUNDSLIKE operator, exact spelling is not required, but
     you may get a large number of names to choose from.  You can restrict
     the number of names by also matching on part of the lastname entered.
     For example,
 
      CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
        FROM employee WHERE scode=.vname2 AND +
        (SGET(lastname,4,1))=(SGET(.vname1,4,1))
 
     This WHERE clause requires that the Soundex code match AND the
     first four characters of the lastname match.
 
     This searching method's drawback is having to add a computed column
     using the UDF function.  Slower processing results because R:BASE must
     ZIP out to execute the UDF program for each row.
 
     An advantage is that you don't need to worry about spelling names
     exactly.  Users can choose from a list of similar sounding names.
 
     Computed column using SGET
     --------------------------
     Another way to use a computed column to create an efficient index
     from a long text field is to create a TEXT 4 column that can be
     indexed and used for searching.  For example, it can be a computed
     column defined as (SGET(lastname,4,1)).  This takes the first four
     characters of the lastname and places them into another column. Index
     this column and use it for searching instead of the LASTNAME column.
     A TEXT 4 column does not need to be hashed when it is indexed so the
     index value equals the data value.  However, this is not necessarily a
     very unique index, particularly on names.
 
     What this method does provide is the ability to use the BETWEEN
     operator on a text search.  Both IHASH and SOUNDEX convert the name
     to an indexed value that can be used with the BETWEEN operator.
     Because they use a conversion process, however, names might convert
     to the same code but are not close together in the alphabet, for
     example, Graham and Greene.  The SGET takes the first four characters
     of the lastname column and does not convert it.  You can then use it
     with the BETWEEN operator to find a range of names.
 
     
     
     
     
     
     
     The command
 
      CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
        FROM employee WHERE first_four BETWEEN 'John' and 'John'
 
     will quickly find all names that begin with 'John'.  Wayne's Corner
     in this issue of the Exchange shows another example of using BETWEEN
     to search for a range of values.
 
     Computed column using an expression
     -----------------------------------
     You may want to define your own expression to create an identifier
     for people.  You could use the first 4 characters of the lastname
     plus the firstname initial, (IHASH(lastname,4)+IHASH(firstname,1)),
     or a lastname and zipcode combination, (lastname+zipcode) or
     (IHASH(lastname,0)+INT(zipcode)).  There are many ways to
     combine/convert names into codes to use for searching. Just make sure
     you are consistent.  If you use the zip code, always use just the
     5-digit zip code.  Don't mix and match 5-digit and 10-digit zip
     codes.
 
     You will want to look at the points raised with the other methods
     discussed here, and decide if it will be an effective indexed column
     for performance and will it be easy for your users.
 
 
     Compare the methods
     -------------------
     By setting up a sample database with different columns and using
     different searching methods you can compare the methods.  Be sure
     your sample database has enough data to provide relevant results. For
     example, a table with 100 or 1000 rows, will not produce comparable
     results if your production table has 100,000 rows of data.
 
     You will probably find that no one method is best for all names.  For
     example, the Soundex code works fast for Gillespie but not for
     Graham; it's over twice as slow because many more names are found.
     Your data and the way you use it will determine the method(s) you
     choose.  Perhaps you'll decide to provide users with two or more
     different searching methods.
 
     The following sample program allows users to search using the
     lastname (indexed LASTNAME column), the company name (indexed
     computed column, (IHASH(company,10)) ), or the Soundex code of the
     lastname (indexed computed column, (UDF('SOUNDEX",lastname)) ).  The
     program can be used as an Entry/Exit procedure in a form.  It
     requires a separate ASCII menu file for the search options.
 
     *(SEARCH.RMD)
     CLS
     CHOOSE vc FROM choice.mnu AT CENTER,CENTER CLEAR
     SWITCH (.vc)
     CASE 'Lastname'
       CLE VAR vans,vkey,vans1
       DIALOG 'Enter lastname to search by ' vans,vkey,1 AT 12
       IF vkey = '[Esc]' THEN
     BREAK
       ENDIF
       CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
     WHERE lastname=.vans ORDER BY lastname,firsname AT CENTER,CENTER +
     FOOTING
       
       
       
       
       
       BREAK
     CASE  'Company'
       CLE VAR vans,vkey,vans1
       DIALOG 'Enter company to search for ' vans,vkey,1 AT 12
       IF vkey = '[Esc]' THEN
     BREAK
       ENDIF
       SET VAR vans1 = (IHASH(.vans,7))
       CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
     WHERE company_hash = .vans1 ORDER BY lastname,firsname AT CENTER,+
     CENTER FOOTING
       BREAK
     CASE 'Soundex code'
       CLE VAR vans,vkey,vans1
       DIALOG 'Enter lastname to search by using Soundex' vans,vkey,1 AT 12
       IF vkey = '[Esc]' THEN
     BREAK
       ENDIF
       SET VAR vans1 = (UDF('SOUNDEX',.vans))
       CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
     WHERE scode=.vans1 ORDER BY lastname,firsname AT CENTER,CENTER +
     FOOTING
       BREAK
     CASE default
       CLE VAR vans,vkey,vans1
       BREAK
     ENDSW
     RETURN
 
     ASCII menu file:
 
     choice
     popup |Choose searching method|
     |Lastname|
     |Company   |
     |Soundex code|