Doc# 746
=====================================================================
Calculating the Mean, Median and Mode
=====================================================================
Products: R:BASE 3.x, R:BASE 4.0, R:BASE 4.0a, R:BASE 4.5
=====================================================================
Area: Application Solutions CATALOG: Programming in R:BASE
=====================================================================
The mean, the median and the mode are three common statistical
measures for looking at how data is grouped around the center or
midpoint of data. Many complex formulas and variations can be used;
here we show how to calculate the mean, the median and the mode using
simple raw data methods.
The mean is the average -- the sum of all the values divided by the
number of values. The mean is affected by values that are very
different. A single very low value or very high value can skew the
mean to the left or right.
The median is the exact midpoint of the ordered set of values -- the
number where half the values are below it and half are above it. With
an odd number of values the median is the actual midpoint value. With
an even number of rows, the median is the average of the two middle
numbers. Very low or very high numbers do not affect the median as
they do the mean.
The mode is the single value with the most occurrences. If each value
occurs only once, there is no mode. If two values occur with the same
frequency, they are both considered the mode and the data is said to
be bimodal. The data can also be multimodal where more than two
values occur with the same frequency.
For example, look at the mean, median and mode for this
list of 10 test scores.
65
75 ---- The mode - the value with the
The mean or 75 ---- most occurrences
average is 84____ 77
87 ____ The median is 87.5, the
88 average of 87 and 88;
92 five values are above it,
93 five values below it
94
98
These measures when used together show if the data distribution is
symmetrical or skewed. When the mean, median and mode are all the
same or very close together, there is a symmetrical distribution.
When the median is less than the mean, the data is skewed to the
left. When the median is greater than the mean, the data is skewed
to the right. The size of the difference indicates the degree the
data is skewed. In the above example, the low score of 65 skews the
mean slightly to the left.
The mean or average is easy to calculate in R:BASE -- just use the
AVG function with the SELECT command. The mode is also calculated
using SELECT, but it requires two commands -- one to find the maximum
number of occurrences in the data for any value, the second to find
the actual values. The median is a more difficult number to find; the
easiest method is to use an ordered view and the special R:BASE
keyword COUNT. The command file below calculates the mean, median and
mode for a selected column. The column cannot be a TEXT, NOTE, DATE
or TIME datatype. The column and table name are held in variables.
The calculations use these as ampersand variables. See the article
"Working with Variables" in the September/October 1993 Exchange for
more information on using ampersand variables.
CLS
SET MESSAGES OFF
SET ERROR MESSAGES OFF
CLEAR ALL VAR
CHOOSE vtab FROM #TABLES AT CENTER,CENTER
-- if no table selected, end the program
SET VAR vkey = (LASTKEY(0))
IF vkey = '[Esc]' THEN
RETURN
ENDIF
CHOOSE vcol FROM #COLUMNS IN &vtab CASCADE
-- if no column selected, end the program
SET VAR vkey = (LASTKEY(0))
IF vkey = '[Esc]' THEN
RETURN
ENDIF
WRITE 'Processing...' AT 24,3
-- check to see if the selected column is a valid
datataype
SELECT sys_data_type INTO vdatatype FROM sys_columns
WHERE sys_column_name = .vcol
IF vdatatype IN (3,8,4,5) THEN
CLS FROM 24,1 TO 24, 50
PAUSE FOR 30 USING 'You selected a TEXT,NOTE,DATE or TIME +
column. You can''t calculate statistics on this column.' +
AT CENTER,CENTER DEFAULT
CLS
RETURN
ENDIF
-- compute the mean (average)
-- the AVG function automatically excludes NULL values
SELECT AVG(&vcol) INTO vaverage vind1 FROM &vtab
SET VAR vmsg1 = ('The mean is '+ CTXT(.vaverage) )
-- compute the median
-- NULL values are excluded by using a WHERE clause
-- on the CREATE VIEW command
DROP VIEW median
CREATE VIEW median as SELECT &vcol FROM &vtab +
WHERE &vcol IS NOT NULL ORDER BY &vcol
COMPUTE vrows AS ROWS FROM &vtab WHERE &vcol IS NOT NULL
-- find out if the number of rows is odd or even
SET VAR vmod = (MOD(.vrows,2))
-- the R:BASE WHERE clause keyword COUNT is used to find the
-- row that is that number from the beginning of the table.
-- The first row of a table has a COUNT of 1. By ordering the
-- rows in a view, we can use the COUNT to easily find the
-- midpoint. A WHERE clause is evaluated before an ORDER BY so
-- we can't use this procedure directly on the table
IF vmod <> 0 THEN
-- an odd number of rows
SET VAR vmid = (INT(.vrows/2)+1)
SELECT &vcol INTO vmedian FROM median WHERE COUNT = .vmid
ELSE
-- an even number of rows
SET VAR vmid1 = (.vrows/2)
SET VAR vmid2 = ((.vrows/2)+1)
SELECT &vcol INTO v_value1 FROM median WHERE COUNT = .vmid1
SELECT &vcol INTO v_value2 FROM median WHERE COUNT = .vmid2
SET VAR vmedian = ((.v_value1+.v_value2)/2)
ENDIF
SET VAR vmsg2 = ('The median is '+ CTXT(.vmedian) )
-- compute the mode
-- NULL values are excluded by using a WHERE clause
-- on the CREATE VIEW command
-- find the maximum number of occurrences
SELECT COUNT(*) INTO vmax vind2 FROM &vtab +
WHERE &vcol IS NOT NULL GROUP BY &vcol ORDER BY 1 DESC
IF vmax = 1 THEN
CLS
WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17
WRITE .vmsg1 AT 9 20
WRITE .vmsg2 AT 11 20
WRITE 'There is no mode' AT 13 20
ELSE
CLS
WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17
WRITE .vmsg1 AT 9 20
WRITE .vmsg2 AT 11 20
SET HEADINGS OFF
WRITE 'the mode values ', 'the number of occurrences' +
AT 13 20
-- position the cursor
SET VAR vblank TEXT = ' '
SHOW VAR vblank AT 14,1
-- select all the values whose number of occurrences matches
-- the maximum
SELECT &vcol=30, COUNT(*)=15 FROM &vtab +
GROUP BY &vcol HAVING COUNT(*) = .vmax
SET HEADINGS ON
ENDIF
RETURN
------------
Please Note: The preceding article contains commands that are
specific to R:BASE 4.5. The R:BASE 4.5 commands are included to
demonstrate new features and capabilities. These commands must be
excluded or modified if you choose to use these programming
techniques with prior versions of R:BASE.