=======================================================================
SEARCHING TEXT COLUMNS
=======================================================================
PRODUCT: R:BASE VERSION : 3.1A Or Higher
=======================================================================
AREA : PROGRAMMING CATEGORY: MANIPULATION DOCUMENT#: 665
=======================================================================
One function useful in almost every database is the ability to search
on a text column, for example, lastname. R:BASE provides a number of
different methods to do this. In this edition of the Exchange, an
article by Royal Harris, "Putting IHASH To Work", describes a method
that worked well for him, but may not suit your database or
application. How do you decide what method is best for you? There
are a number of factors to consider, the two most important being ease
of use and performance.
Ease of Use
-----------
Ease of use is very subjective. What appears easy to one person may
appear incredibly convoluted and complex to another. This is an area
best left to the individual developer's discretion and is not
addressed in this article. The article touches on the amount of
custom programming that may be required to implement a particular
method, and adding any needed columns to the database.
Performance
-----------
Proper use of indexing is vital to performance. You'll want to
review information on indexing from the Command Dictionary entry on
Indexes, the Supercharging R:BASE booklet, and articles in the
May/June 1992 and July/August 1991 issues of the Exchange.
The single most important factor in determining the effectiveness of
an index is the uniqueness of index values. When R:BASE builds an
index, it builds an entry for each unique value. As it goes through
the data, if it finds two rows of data with the same index value it
builds a "multiple occurrence table" (MOT) for that index value. A
unique index value is faster than a value with multiple index
occurrences. You can have multiple occurrences of index values if
you have more than one row with the same data value in a column, or
if you have a computed column (such as IHASH) whose values share the
same result.
Note that we are talking about unique index values, not data values.
An index value may or may not be the same as a data value. If your
column datatype is INTEGER or TEXT 4, for example, your index value
and data value will be the same. If your column is TEXT 16, then
your data value must be converted (hashed) into a 4-byte index
value. Long text columns that are indexed, then, have a higher
probability of unique data values creating non-unique index values.
RSCOPE can be used to check indexes and see how many duplicate
(multiple ocurrence) versus unique entries there are, and how many
times R:BASE must read the index file to find the value. An index
loses its effectiveness as duplicate values increase and R:BASE must
make more reads and comparisons.
Indexed columns can also affect performance when adding or changing
data in a table. Indexed columns must be updated when a row is added
or when the index column value is changed. The number of indexed
columns in a table affects the speed with which rows are added or
changed in the table. Take this into account when adding special
columns for searching. It is faster to load and change data on a
table with one indexed column than on a table with seven indexed
columns.
A computed column must also be updated when a new row is added to the
table or when the data on which it is based changes. This is
important when deciding how many, if any, computed columns you want
to add to your database for searching purposes. Incorporating
indexed computed columns for searching slows the process of adding or
changing data in the table.
Methods of searching
--------------------
A number of different searching methods are described below with
discussion of the pros and cons of each method. This should help you
pick a text searching method or combination of methods that fits your
needs. All the examples use a table named EMPLOYEE and a column
named LASTNAME.
Using the text column only
--------------------------
This is the most straightforward method. Simply index the LASTNAME
column and use the where clause, WHERE LASTNAME = value, for
searching. The equals operator (=) will use indexes. Using other
operators on a TEXT column with a defined length greater than 4 will
not use indexes.
The advantage of this method is it's simplicity: no columns to add
to the database, no extra programming, and users can easily do ad hoc
queries through the R:BASE main menu.
On a LASTNAME column, you can have a fairly effective index even
though the data values must be hashed to a 4-byte index value. You'll
have some duplicate index values but names are fairly well
distributed throughout the alphabet. On a sample 42,000 row table,
performance was comparable to all other methods when searching for
a unique name or a common name (such as Johnson) that occurs many
times. You will, of course, need to test this on your data.
What this method doesn't provide is options. You need to enter the
complete lastname and you need to enter it exactly; if you leave off
a letter or misspell it, nothing is found. If you don't find the
person did you misspell the name or are they really not there? Also,
if you are searching on a text column other than LASTNAME (for
example, PART_NUMBER) you may find that just indexing the column is
not efficient and you'll want to use an alternative method.
Computed column using IHASH
---------------------------
R:BASE includes a function that can be used to create an integer
value from a text value. The function was designed to create
effective integer keys from long text columns. The function, IHASH,
converts the entire text value, or just a specified number of
characters.
Using this method is more complex that just indexing the LASTNAME
column. First you need to add a computed column to your table using
the IHASH function on the LASTNAME column to convert its text to
integer values. You can modify your table through the Create/modify
option from the Info pulldown on the R:BASE main menu or use the
ALTER TABLE command:
ALTER TABLE employee ADD Hash_Lname=(IHASH(lastname,0)) INTEGER
The IHASH function converts the entire name to integer when used
with the parameter 0. A different parameter converts the
specified number of characters from the name, starting at the first
character. For example, the parameter 7 will convert the first 7
characters of the lastname to an integer value. Deciding on the
number of characters to convert can be one of the hardest things
about using this method. Consider the relationships expressed in the
following chart:
| Convert FEW characters | Convert MORE characters
-------+---------------------------+--------------------------
PROS | less input required | less duplicate values
-------+---------------------------+--------------------------
CONS | greater duplicate values | more input required
-------+---------------------------+--------------------------
After adding the computed column to your table, you need to use some
programming commands as shown below to query that column. Using the
IHASH function directly in a WHERE clause won't use indexes. First
set a variable equal to IHASH of the value you're searching for, then
use the variable in the WHERE clause. When using an IHASH column for
searching, you won't be able to do ad hoc queries from the R:BASE main
menu .
SET VAR vname = (IHASH('Smith',0))
SEL * FROM employee WHERE Hash_Lname = .vname
This method does provide greater flexibility in that you can have
users enter anywhere from 1 character to the entire name based on the
number of characters you specify in the IHASH function. For example,
add a computed column to the table that will IHASH the first four
characters of the name. Then, in your program, check the length that
the user enters and if it's greater than four characters use an extra
condition on your WHERE clause.
FILLIN vname USING 'Enter lastname (at least 4 characters): '
SET VAR vlen1=(SLEN(.vname)),vname1=(SGET(.VNAME,4,1)),+
vhash=(IHASH(.vname1,4))
IF vlen1 > 4 THEN
CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
FROM employee WHERE Hash_Lname=.vhash AND +
(SGET(lastname,.vlen1,1))=.vname
ELSE
CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
FROM employee WHERE Hash_Lname=.vhash
ENDIF
A user can enter any number of letters for use with an IHASH
computation, but must enter at least as many characters as specified
in the IHASH column definition or enter the full name. If the entry
less than the specified number of characters and less than the full
length of the name, the correct data is not found. For example, with
a column defined as (IHASH(lastname,7)), entering "WILL" will not
find "WILLIAMS", it will only find "WILL".
The advantages of using a computed column with the IHASH function are
that you can turn an inefficient TEXT index into an efficient INTEGER
index and you can provide flexibility in searching. Users will have
a larger selection of names to choose from and can select the
appropriate person from the list. For example, entering WILLIAM will
find WILLIAMSON, WILLIAM, and WILLIAMS if you use IHASH(lastname,7).
A disadvantage of this method is that you need to add columns to
your database. An extra computed column can slow down data entry. If
you are tight on disk space this may not be an option. To determine
how much additional disk space you'll need for an IHASH column, take
the number of rows in the table and multiply by 4. The answer is the
number of bytes of disk space you'll need for the additional column.
Use the SOUNDSLIKE operator
---------------------------
R:BASE has an undocumented WHERE clause operator,SOUNDSLIKE, that can
by used to search text columns. It uses the Soundex algorithm, the
same one that is used by the U.S. Census Bureau to store census
information. Like its name implies, the SOUNDSLIKE operator uses the
Soundex algorithm to find similar sounding names. The Soundex
algorithm keeps the first character of the name and then converts the
next 3 consonants to numbers using the following chart. Vowels are
ignored, consonants W, Y and H are ignored, double consonants are
treated as one, and side-by-side letters with the same number count
as one. Empty spaces at the end of the code are filled with zeros.
the number represents the letters
---------- ----------------------
1 B P F V
2 C S K G J Q X Z
3 D T
4 L
5 M N
6 R
For example, the name GRAHAM converts to the code G650. "G" for the
first letter, 6 for the "R", the "AHA" is ignored, 5 for the "M" and
then 0 to fill the code out to 4 characters. WILLIAM will convert to
the code W450, WILLIAMS to the code W452.
You can use the SOUNDSLIKE operator directly in a WHERE clause:
BROWSE * FROM employee WHERE lastname SOUNDSLIKE 'Graham'
In addition to rows where the lastname equals Graham, this condition
would also return rows for lastnames Grahm, Grim, Grimm, Green,
Greene, Groome, Gorin, Gahring and many others. But SOUNDSLIKE
"William" will not return "Williams" because the code is different.
The SOUNDSLIKE operator provides a different way of searching. You
don't need to spell the name exactly and it finds many similar
sounding names. But, as it doesn't use indexes, on a large table it
can be slow.
Computed column using SOUNDEX
-----------------------------
You don't need to use the SOUNDSLIKE operator to take advantage of
the Soundex code. With R:BASE 3.1C or higher you can use a UDF that
calculates and returns the Soundex code. Microrim includes a Soundex
UDF with UDF pack #3, available from our Sales Department at
800-628-6990.
To use the UDF, create a computed column with the expression
(UDF('SOUNDEX',lastname)) and datatype TEXT 4. Index the column. As
with the IHASH function, the UDF function can't be used directly in a
WHERE clause. You need to first convert the value then use the
converted value in the WHERE clause so it will use indexes. For
example,
FILLIN vname1 USING 'Enter lastname: '
SET VAR vname2=(UDF('SOUNDEX',.vname))
CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
FROM employee WHERE scode=.vname2
As with the SOUNDSLIKE operator, exact spelling is not required, but
you may get a large number of names to choose from. You can restrict
the number of names by also matching on part of the lastname entered.
For example,
CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
FROM employee WHERE scode=.vname2 AND +
(SGET(lastname,4,1))=(SGET(.vname1,4,1))
This WHERE clause requires that the Soundex code match AND the
first four characters of the lastname match.
This searching method's drawback is having to add a computed column
using the UDF function. Slower processing results because R:BASE must
ZIP out to execute the UDF program for each row.
An advantage is that you don't need to worry about spelling names
exactly. Users can choose from a list of similar sounding names.
Computed column using SGET
--------------------------
Another way to use a computed column to create an efficient index
from a long text field is to create a TEXT 4 column that can be
indexed and used for searching. For example, it can be a computed
column defined as (SGET(lastname,4,1)). This takes the first four
characters of the lastname and places them into another column. Index
this column and use it for searching instead of the LASTNAME column.
A TEXT 4 column does not need to be hashed when it is indexed so the
index value equals the data value. However, this is not necessarily a
very unique index, particularly on names.
What this method does provide is the ability to use the BETWEEN
operator on a text search. Both IHASH and SOUNDEX convert the name
to an indexed value that can be used with the BETWEEN operator.
Because they use a conversion process, however, names might convert
to the same code but are not close together in the alphabet, for
example, Graham and Greene. The SGET takes the first four characters
of the lastname column and does not convert it. You can then use it
with the BETWEEN operator to find a range of names.
The command
CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) +
FROM employee WHERE first_four BETWEEN 'John' and 'John'
will quickly find all names that begin with 'John'. Wayne's Corner
in this issue of the Exchange shows another example of using BETWEEN
to search for a range of values.
Computed column using an expression
-----------------------------------
You may want to define your own expression to create an identifier
for people. You could use the first 4 characters of the lastname
plus the firstname initial, (IHASH(lastname,4)+IHASH(firstname,1)),
or a lastname and zipcode combination, (lastname+zipcode) or
(IHASH(lastname,0)+INT(zipcode)). There are many ways to
combine/convert names into codes to use for searching. Just make sure
you are consistent. If you use the zip code, always use just the
5-digit zip code. Don't mix and match 5-digit and 10-digit zip
codes.
You will want to look at the points raised with the other methods
discussed here, and decide if it will be an effective indexed column
for performance and will it be easy for your users.
Compare the methods
-------------------
By setting up a sample database with different columns and using
different searching methods you can compare the methods. Be sure
your sample database has enough data to provide relevant results. For
example, a table with 100 or 1000 rows, will not produce comparable
results if your production table has 100,000 rows of data.
You will probably find that no one method is best for all names. For
example, the Soundex code works fast for Gillespie but not for
Graham; it's over twice as slow because many more names are found.
Your data and the way you use it will determine the method(s) you
choose. Perhaps you'll decide to provide users with two or more
different searching methods.
The following sample program allows users to search using the
lastname (indexed LASTNAME column), the company name (indexed
computed column, (IHASH(company,10)) ), or the Soundex code of the
lastname (indexed computed column, (UDF('SOUNDEX",lastname)) ). The
program can be used as an Entry/Exit procedure in a form. It
requires a separate ASCII menu file for the search options.
*(SEARCH.RMD)
CLS
CHOOSE vc FROM choice.mnu AT CENTER,CENTER CLEAR
SWITCH (.vc)
CASE 'Lastname'
CLE VAR vans,vkey,vans1
DIALOG 'Enter lastname to search by ' vans,vkey,1 AT 12
IF vkey = '[Esc]' THEN
BREAK
ENDIF
CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
WHERE lastname=.vans ORDER BY lastname,firsname AT CENTER,CENTER +
FOOTING
BREAK
CASE 'Company'
CLE VAR vans,vkey,vans1
DIALOG 'Enter company to search for ' vans,vkey,1 AT 12
IF vkey = '[Esc]' THEN
BREAK
ENDIF
SET VAR vans1 = (IHASH(.vans,7))
CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
WHERE company_hash = .vans1 ORDER BY lastname,firsname AT CENTER,+
CENTER FOOTING
BREAK
CASE 'Soundex code'
CLE VAR vans,vkey,vans1
DIALOG 'Enter lastname to search by using Soundex' vans,vkey,1 AT 12
IF vkey = '[Esc]' THEN
BREAK
ENDIF
SET VAR vans1 = (UDF('SOUNDEX',.vans))
CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust +
WHERE scode=.vans1 ORDER BY lastname,firsname AT CENTER,CENTER +
FOOTING
BREAK
CASE default
CLE VAR vans,vkey,vans1
BREAK
ENDSW
RETURN
ASCII menu file:
choice
popup |Choose searching method|
|Lastname|
|Company |
|Soundex code|