K NEAREST NEIGHBORS CLASSICIATION PLOT

Name:

K NEAREST NEIGHBORS CLASSICIATION PLOT Type:

Graphics Command Purpose:

Generate a k nearest neighbors classification plot. Description:

The data consists of both training data and observations to be classified. The training data are observations where the group-id is known.

The k nearest neighbors classifies an observation based on the most common class in the k nearest neighbors. If there are ties, the class with the combined minimum distance to the observation will be the class to which the observation is assigned.

The criterion for "nearest" is the Euclidean distance between the observation and one of the training observations. By default, the Euclidean distance will be used. See the Note section below to see how to use a different distance metric.

If there are two variables, the y-axis is the first variable and the x-axis is the second variable. If there are more than two variables, the y-axis is the first principal component of all of the variables and the x-axis is the second principal component of all of the variables. For the case with more than two variables you can alternatively choose to use the first two variables rather than the principal components by entering the command

To reset the default, enter

If there are L categories, then the first L traces in the plot are the L categories for the training data. Similarly, traces L+1 to 2*L are the L categories for the observations to be classified. The ordering is from the low value for category to the high value of category. For example, if there are two categories, you might do something like

This will draw the training observations as black unfilled circles and squares and the observations to be classified as red or blue filled circles or squares. This is demonstrated in the Program example below.

Syntax:

All of the variables must have the same length. Values of the <tag> variable with a value of zero are the observations to be classified. If no values in the <tag> variable are identified with a value of zero, an error will be reported and no plot is generated.

Examples:

Note:

SET NEAREST NEIGHBOR CLASSIFICATION K <value>

where <value> is a positive integer. There is a trade-off in setting the value of K. Larger values of K can reduce the effect of noise on the classification at the cost of making the boundaries between classes less distinct.

Note:

SET NEAREST NEIGHBOR CLASSIFICATION DISTANCE <value>

where distance is one of the following

Enter HELP MATRIX DISTANCE to see the definition of these differences.

Note:

The classification results are written to the file dpst1f.dat. The first column contains the row-id of the observation being classified and the second column specifies the group to which the observation is assigned. Note:

The distance matrix is written to dpst2f.dat Since the distance matrix is symmetric, only the upper triangular part of the matrix is printed. Specifically, the first column is the row-id, the second line is the column-id and the third column is the assocciated distance value. Note:

The K NEAREST NEIGHBORS CLASSIFICATION PLOT does not standardize the data. If you want to standardize the data, do that before utilizing this command. This is demonstrated in the Program example below. Performing the standardization as a separate step allows more flexibility in the choice of standardization method.

Note:

If there are a large number of variables, it may be helpful to perform some dimension reduction first. For example, you can take the principal components of the original data and generate the K NEAREST NEIGHBORS CLASSIFICATION PLOT based on the most important principal components. Note:

If there are more than 50 variables, it is recommended that some type of dimension reduction, such as principal components, be used.

Defaults:

The first two principal components will be used for the plot when there are more than two variables.

Euclidean distances will be used.

Synonyms:

Related Commands:

FISHER DISCRIMINANT PLOT	=	= Generate a Fisher discriminant plot.
PRINCIPAL COMPONENTS	=	Compute the principal components of a matrix.
STANDARDIZE	=	Standardize a variable.

Reference:

Springer

Applications:

Classification Implementation Date:

2024/07 Program:

 
. Step 1:   Read the data and standardize based on z-scores
.
skip 25
read iris.dat seplen sepwidth petlen petwidth tag
skip 0
let x1 = zscore seplen
let x2 = zscore sepwidth
let x3 = zscore petlen
let x4 = zscore petwidth
.
. Step 2:   Generate the plot
.
line blank all
character hw 1 0.75 all
character circle triangle revtri circle triangle revtri
character fill off off off on on on
character color black black black blue red green
.
y1label First Principal Component
x1label Second Principal Component
title K Nearest Neighbors Classification Plot for Iris Data
.
. Step 3:   Specify rows to be classified
.
let tag2 = tag
let tag2 = 0 for i = 10 10 150
let ktemp = 3
set nearest neighbor classification k ^ktemp
x2label K = ^ktemp
.
k nearest neighbors classification plot x1 x2 x3 x4 tag2