SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 2 Vol 1

PEARSON DISSIMILARITY

Name:
    PEARSON DISSIMILARITY (LET)
    PEARSON DISSIMILARITY (LET)
Type:
    Let Subcommand
Purpose:
    Compute the Pearson correlation coefficient transformed to a dissimilarity measure between two variables.
Description:
    The correlation coefficient is a measure of the linear relationship between two variables. It is computed as:

      \( S_{xx} = \sum_{i=1}^{N}{(X_{i} - \bar{X})^{2}} \)

      \( S_{yy} = \sum_{i=1}^{N}{(Y_{i} - \bar{Y})^{2}} \)

      \( S_{xy} = \sum_{i=1}^{N}{(Y_{i} - \bar{Y})(X_{i} - \bar{X})} \)

      \( r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}} \)

    A perfect linear relationship yields a correlation coefficient of +1 (or -1 for a negative relationship) and no linear relationship yields a correlation coefficient of 0.

    In some applications, such as clustering, it can be useful to transform the correlation coefficient to a dissimilarity measure. The transformation used here is

      \( d = \frac{1 - r}{2} \)

    This converts the correlation coefficient with values between -1 and 1 to a score between 0 and 1. High positive correlation (i.e., very similar) results in a dissimilarity near 0 and high negative correlation (i.e., very dissimilar) results in a dissimilarity near 1.

    If a similarity score is preferred, you can use

      \( s = 1 - d \)

    where d is defined as above.

Syntax 1:
    LET <par> = PEARSON DISSIMILARITY <y1> <y2>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> is the first response variable;
                <y2> is the second response variable;
                <par> is a parameter where the computed Pearson dissimilarity is stored;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 2:
    LET <par> = PEARSON SIMILARITY <y1> <y2>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> is the first response variable;
                <y2> is the second response variable;
                <par> is a parameter where the computed Pearson similarity is stored;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Examples:
    LET A = PEARSON DISSIMILARITY Y1 Y2
    LET A = PEARSON DISSIMILARITY Y1 Y2 SUBSET TAG > 2
    LET A = PEARSON SIMILARITY Y1 Y2
Note:
    The two variables must have the same number of elements.
Default:
    None
Synonyms:
    PEARSON DISTANCE is a synonym for PEARSON DISSIMILARITY
Related Commands: Reference:
    Kaufman and Rousseeuw (1990), "Finding Groups in Data: An Introduction To Cluster Analysis", Wiley.
Applications:
    Clustering
Implementation Date:
    2017/08:
    2018/10: Added PEARSON SIMILARITY
Program 1:
     
    SKIP 25
    READ BERGER1.DAT Y X
    LET CORR = CORRELATION Y X
    LET D    = PEARSON DISSIMILARITY Y X
    PRINT CORR D
        
    The following output is generated
     PARAMETERS AND CONSTANTS--
    
        CORR    --          0.946
        D       --          0.027
        
Program 2:
     
    SKIP 25
    READ IRIS.DAT Y1 Y2 Y3 Y4
    SET WRITE DECIMALS 3
    .
    LET M = GENERATE MATRIX PEARSON DISSIMILARITY Y1 Y2 Y3 Y4
    PRINT M
        
    The following output is generated
     
            MATRIX M       --            4 ROWS
                           --            4 COLUMNS
    
     VARIABLES--M1             M2             M3             M4      
    
             -0.000          0.559          0.075          0.155
              0.559          0.000          0.736          0.534
              0.075          0.736          0.000          0.144
              0.155          0.534          0.144          0.000
        
Program 3:
     
    SKIP 25
    READ IRIS.DAT Y1 Y2 Y3 Y4 TAG
    .
    TITLE CASE ASIS
    TITLE OFFSET 2
    LABEL CASE ASIS
    TIC MARK OFFSET UNITS DATA
    Y1LABEL Pearson Dissimilarity Coefficient
    YLIMITS 0 1
    MAJOR YTIC MARK NUMBER 6
    MINOR YTIC MARK NUMBER 1
    Y1TIC MARK LABEL DECIMAL 1
    Y1LABEL DISPLACEMENT 20
    X1LABEL Species
    XLIMITS 1 3
    MAJOR XTIC MARK NUMBER 3
    MINOR XTIC MARK NUMBER 0
    XTIC MARK OFFSET 0.3 0.3
    X1LABEL DISPLACEMENT 14
    CHARACTER X BLANK
    LINES BLANK SOLID
    .
    MULTIPLOT CORNER COORDINATES 5 5 95 95
    MULTIPLOT SCALE FACTOR 2
    MULTIPLOT 2 3
    .
    TITLE Sepal Length vs Sepal Width
    CORRELATION PLOT Y1 Y2 TAG
    .
    TITLE Sepal Length vs Petal Length
    CORRELATION PLOT Y1 Y3 TAG
    .
    TITLE Sepal Length vs Petal Width
    CORRELATION PLOT Y1 Y4 TAG
    .
    TITLE Sepal Width vs Petal Length
    CORRELATION PLOT Y2 Y3 TAG
    .
    TITLE Sepal Width vs Petal Width
    CORRELATION PLOT Y2 Y4 TAG
    .
    TITLE Petal Length vs Petal Width
    CORRELATION PLOT Y3 Y4 TAG
    .
    END OF MULTIPLOT
        

    plot generated by sample program

Privacy Policy/Security Notice
Disclaimer | FOIA

NIST is an agency of the U.S. Commerce Department.

Date created: 09/05/2017
Last updated: 09/05/2017

Please email comments on this WWW page to alan.heckert@nist.gov.