SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 2 Vol 1

K NEAREST NEIGHBORS CLASSICIATION PLOT

Name:
    K NEAREST NEIGHBORS CLASSICIATION PLOT
Type:
    Graphics Command
Purpose:
    Generate a k nearest neighbors classification plot.
Description:
    The classification problem is to assign an observation to a group. It is assumed that the groups are mutually exclusive (i.e., an observation belongs to exactly one group) and exhaustive (i.e., an observation has to belong to one of the groups).

    The data consists of both training data and observations to be classified. The training data are observations where the group-id is known.

    The k nearest neighbors classifies an observation based on the most common class in the k nearest neighbors. If there are ties, the class with the combined minimum distance to the observation will be the class to which the observation is assigned.

    The criterion for "nearest" is the Euclidean distance between the observation and one of the training observations. By default, the Euclidean distance will be used. See the Note section below to see how to use a different distance metric.

    If there are two variables, the y-axis is the first variable and the x-axis is the second variable. If there are more than two variables, the y-axis is the first principal component of all of the variables and the x-axis is the second principal component of all of the variables. For the case with more than two variables you can alternatively choose to use the first two variables rather than the principal components by entering the command

      SET NEAREST NEIGHBORS CLASSIFICATION PRINCIPAL ...
                  COMPONENTS NO

    To reset the default, enter

      SET NEAREST NEIGHBORS CLASSIFICATION PRINCIPAL ...
                  COMPONENTS YES

    If there are L categories, then the first L traces in the plot are the L categories for the training data. Similarly, traces L+1 to 2*L are the L categories for the observations to be classified. The ordering is from the low value for category to the high value of category. For example, if there are two categories, you might do something like

      LINE BLANK ALL
      CHARACTER CIRCLE SQUARE CIRCLE SQUARE
      CHARACTER FILL OFF OFF ON ON
      CHARACTER COLOR BLACK BLACK RED BLUE

    This will draw the training observations as black unfilled circles and squares and the observations to be classified as red or blue filled circles or squares. This is demonstrated in the Program example below.

Syntax:
    K NEAREST NEIGBORS CLASSIFICATION PLOT <y1> ... <yk> <tag>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> ... <yk> is a list of response variables;
                <tag> is the group-id variable;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    All of the variables must have the same length. Values of the <tag> variable with a value of zero are the observations to be classified. If no values in the <tag> variable are identified with a value of zero, an error will be reported and no plot is generated.

Examples:
    K NEAREST NEIGHBORS CLASSIFICATION PLOT X1 X2 X3 X4 TAG
    K NEAREST NEIGHBORS CLASSIFICATION PLOT X1 X2 X3 TAG ...
                SUBSET TAG >= 0
Note:
    By default, Dataplot will use the 5 nearest neighbors. To change this, enter the command

      SET NEAREST NEIGHBOR CLASSIFICATION K <value>

    where <value> is a positive integer. There is a trade-off in setting the value of K. Larger values of K can reduce the effect of noise on the classification at the cost of making the boundaries between classes less distinct.

Note:
    By default, minimum Euclidean distance is used as the metric for "closest". In most cases, this should be the distance used. However, you can specify an alternative measure of distance with the command

      SET NEAREST NEIGHBOR CLASSIFICATION DISTANCE <value>

    where distance is one of the following

      EUCLIDEAN
      MINKOWSKY
      BLOCK
      CANBERRA
      CHEBYCHEV
      COSINE
      ANGULAR COSINE
      JACCARD
      PEARSON
      HAMMING

    Enter HELP MATRIX DISTANCE to see the definition of these differences.

Note:
    The classification results are written to the file dpst1f.dat. The first column contains the row-id of the observation being classified and the second column specifies the group to which the observation is assigned.
Note:
    The distance matrix is written to dpst2f.dat Since the distance matrix is symmetric, only the upper triangular part of the matrix is printed. Specifically, the first column is the row-id, the second line is the column-id and the third column is the assocciated distance value.
Note:
    If the variables are on different scales, it may be useful to standardize the variables. The most common standardization is the z-score (i.e., subtract the mean and divide by the standard deviation).

    The K NEAREST NEIGHBORS CLASSIFICATION PLOT does not standardize the data. If you want to standardize the data, do that before utilizing this command. This is demonstrated in the Program example below. Performing the standardization as a separate step allows more flexibility in the choice of standardization method.

Note:
    If there are a large number of variables, it may be helpful to perform some dimension reduction first. For example, you can take the principal components of the original data and generate the K NEAREST NEIGHBORS CLASSIFICATION PLOT based on the most important principal components.
Note:
    This command currently allows a maximum of 50 variables and a maximum of 50 classes in the group-id variable.

    If there are more than 50 variables, it is recommended that some type of dimension reduction, such as principal components, be used.

Defaults:
    K = 5

    The first two principal components will be used for the plot when there are more than two variables.

    Euclidean distances will be used.

Synonyms:
    K NEAREST NEIGHBORS PLOT
    K NEAREST NEIGHBORS DISCRIMINATION PLOT
    KNN CLASSIFICATION PLOT
    KNN DISCRIMINATION PLOT
    KNN PLOT
Related Commands: Reference:
    Hastie, Tibshirani, and Friedman (2001), "The Elements of Statistical Learning: Data Mining, Inference, and Prediction", Springer, Chapters 2 and 4.
Applications:
    Classification
Implementation Date:
    2024/07
Program:
     
    . Step 1:   Read the data and standardize based on z-scores
    .
    skip 25
    read iris.dat seplen sepwidth petlen petwidth tag
    skip 0
    let x1 = zscore seplen
    let x2 = zscore sepwidth
    let x3 = zscore petlen
    let x4 = zscore petwidth
    .
    . Step 2:   Generate the plot
    .
    line blank all
    character hw 1 0.75 all
    character circle triangle revtri circle triangle revtri
    character fill off off off on on on
    character color black black black blue red green
    .
    y1label First Principal Component
    x1label Second Principal Component
    title K Nearest Neighbors Classification Plot for Iris Data
    .
    . Step 3:   Specify rows to be classified
    .
    let tag2 = tag
    let tag2 = 0 for i = 10 10 150
    let ktemp = 3
    set nearest neighbor classification k ^ktemp
    x2label K = ^ktemp
    .
    k nearest neighbors classification plot x1 x2 x3 x4 tag2
        
Date created: 07/31/2024
Last updated: 07/31/2024

Please email comments on this WWW page to alan.heckert@nist.gov.