Doc# 746
     =====================================================================
     Calculating the Mean, Median and Mode
     =====================================================================
     Products: R:BASE 3.x, R:BASE 4.0, R:BASE 4.0a, R:BASE 4.5
     =====================================================================
     Area: Application Solutions            CATALOG: Programming in R:BASE
     =====================================================================
 
     The mean, the median and the mode are three common statistical 
     measures for looking at how data is grouped around the center or 
     midpoint of data. Many complex formulas and variations can be used; 
     here we show how to calculate the mean, the median and the mode using 
     simple raw data methods.
     
     The mean is the average -- the sum of all the values divided by the 
     number of values. The mean is affected by values that are very 
     different. A single very low value or very high value can skew the 
     mean to the left or right.
     
     The median is the exact midpoint of the ordered set of values -- the 
     number where half the values are below it and half are above it. With 
     an odd number of values the median is the actual midpoint value. With 
     an even number of rows, the median is the average of the two middle 
     numbers. Very low or very high numbers do not affect the median as 
     they do the mean.
     
     The mode is the single value with the most occurrences. If each value 
     occurs only once, there is no mode. If two values occur with the same 
     frequency, they are both considered the mode and the data is said to 
     be bimodal.  The data can also be multimodal where more than two
     values occur with the same frequency.
     
     For example, look at the mean, median and mode for this
     list of 10 test scores.
     
                       65
                       75 ---- The mode - the value with the 
     The mean or       75 ---- most occurrences
     average is 84____ 77
                       87 ____ The median is 87.5, the
                       88      average of 87 and 88;
                       92      five values are above it,
                       93      five values below it
                       94     
                       98
     
     These measures when used together show if the data distribution is 
     symmetrical or skewed. When the mean, median and mode are all the 
     same or very close together, there is a symmetrical distribution. 
     When the median is less than the mean, the data is skewed to the
     left. When the median is greater than the mean, the data is skewed 
     to the right. The size of the difference indicates the degree the 
     data is skewed. In the above example, the low score of 65 skews the 
     mean slightly to the left.
     
     The mean or average is easy to calculate in R:BASE -- just use the 
     AVG function with the SELECT command. The mode is also calculated 
     using SELECT, but it requires two commands -- one to find the maximum 
     number of occurrences in the data for any value, the second to find 
     the actual values. The median is a more difficult number to find; the 
     easiest method is to use an ordered view and the special R:BASE 
     keyword COUNT. The command file below calculates the mean, median and 
     mode for a selected column. The column cannot be a TEXT, NOTE, DATE 
     or TIME datatype. The column and table name are held in variables. 
     The calculations use these as ampersand variables. See the article 
     "Working with Variables" in the September/October 1993 Exchange for
     more information on using ampersand variables.
     
     
        CLS
        SET MESSAGES OFF
        SET ERROR MESSAGES OFF
        CLEAR ALL VAR
        CHOOSE vtab FROM #TABLES AT CENTER,CENTER
        -- if no table selected, end the program
        SET VAR vkey = (LASTKEY(0))
        IF vkey = '[Esc]' THEN
        RETURN
        ENDIF
        CHOOSE vcol FROM #COLUMNS IN &vtab CASCADE
        -- if no column selected, end the program
        SET VAR vkey = (LASTKEY(0))
        IF vkey = '[Esc]' THEN
        RETURN
        ENDIF
        WRITE 'Processing...' AT 24,3
        -- check to see if the selected column is a valid
        datataype
        SELECT sys_data_type INTO vdatatype FROM sys_columns
          WHERE sys_column_name = .vcol
        IF vdatatype IN (3,8,4,5) THEN
            CLS FROM 24,1 TO 24, 50
            PAUSE FOR 30 USING 'You selected a TEXT,NOTE,DATE or TIME +
            column. You can''t calculate statistics on this column.'  +
            AT CENTER,CENTER DEFAULT
            CLS
            RETURN
        ENDIF
        -- compute the mean (average)
        -- the AVG function automatically excludes NULL values
        SELECT AVG(&vcol) INTO vaverage vind1 FROM &vtab
        SET VAR vmsg1 = ('The mean is '+ CTXT(.vaverage) )
             
        -- compute the median
        -- NULL values are excluded by using a WHERE clause
        -- on the CREATE VIEW command
        DROP VIEW median
        CREATE VIEW median as SELECT &vcol FROM &vtab +
            WHERE &vcol IS NOT NULL ORDER BY &vcol
        COMPUTE vrows AS ROWS FROM &vtab WHERE &vcol IS NOT NULL
        -- find out if the number of rows is odd or even
        SET VAR vmod = (MOD(.vrows,2))
        -- the R:BASE WHERE clause keyword COUNT is used to find the 
        -- row that is that number from the beginning of the table.
        -- The first row of a table has a COUNT of 1. By ordering the 
        -- rows in a view, we can use the COUNT to easily find the 
        -- midpoint. A WHERE clause is evaluated before an ORDER BY so 
        -- we can't use this procedure directly on the table
        IF vmod <> 0 THEN
           -- an odd number of rows
           SET VAR vmid = (INT(.vrows/2)+1)
           SELECT &vcol INTO vmedian FROM median WHERE COUNT = .vmid
        ELSE
           -- an even number of rows
           SET VAR vmid1 = (.vrows/2)
           SET VAR vmid2 = ((.vrows/2)+1)
           SELECT &vcol INTO v_value1 FROM median WHERE COUNT = .vmid1
           SELECT &vcol INTO v_value2 FROM median WHERE COUNT = .vmid2
           SET VAR vmedian = ((.v_value1+.v_value2)/2)
        ENDIF
        SET VAR vmsg2 = ('The median is '+ CTXT(.vmedian) )
     
        -- compute the mode
        -- NULL values are excluded by using a WHERE clause
        -- on the CREATE VIEW command
             
        -- find the maximum number of occurrences
        SELECT COUNT(*) INTO vmax vind2 FROM &vtab +
        WHERE &vcol IS NOT NULL GROUP BY &vcol ORDER BY 1 DESC
        IF vmax = 1 THEN
            CLS
            WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17
            WRITE .vmsg1 AT 9 20
            WRITE .vmsg2 AT 11 20
            WRITE 'There is no mode' AT 13 20
        ELSE
            CLS
            WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17
            WRITE .vmsg1 AT 9 20
            WRITE .vmsg2 AT 11 20
            SET HEADINGS OFF
            WRITE 'the mode values   ', 'the number of occurrences'  + 
                AT 13 20
           -- position the cursor
           SET VAR vblank TEXT = ' '
           SHOW VAR vblank AT 14,1
           -- select all the values whose number of occurrences matches
           -- the maximum
           SELECT &vcol=30, COUNT(*)=15 FROM &vtab +
           GROUP BY &vcol HAVING COUNT(*) = .vmax
           SET HEADINGS ON
        ENDIF
        RETURN
     
        ------------
        Please Note:  The preceding article contains commands that are 
        specific to R:BASE 4.5.  The R:BASE 4.5 commands are included to 
        demonstrate new features and capabilities.  These commands must be 
        excluded or modified if you choose to use these programming
        techniques with prior versions of R:BASE.