Dataplot Vol 2 Vol 1

# PEARSON DISSIMILARITY

Name:
PEARSON DISSIMILARITY (LET)
PEARSON DISSIMILARITY (LET)
Type:
Let Subcommand
Purpose:
Compute the Pearson correlation coefficient transformed to a dissimilarity measure between two variables.
Description:
The correlation coefficient is a measure of the linear relationship between two variables. It is computed as:

$$S_{xx} = \sum_{i=1}^{N}{(X_{i} - \bar{X})^{2}}$$

$$S_{yy} = \sum_{i=1}^{N}{(Y_{i} - \bar{Y})^{2}}$$

$$S_{xy} = \sum_{i=1}^{N}{(Y_{i} - \bar{Y})(X_{i} - \bar{X})}$$

$$r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}$$

A perfect linear relationship yields a correlation coefficient of +1 (or -1 for a negative relationship) and no linear relationship yields a correlation coefficient of 0.

In some applications, such as clustering, it can be useful to transform the correlation coefficient to a dissimilarity measure. The transformation used here is

$$d = \frac{1 - r}{2}$$

This converts the correlation coefficient with values between -1 and 1 to a score between 0 and 1. High positive correlation (i.e., very similar) results in a dissimilarity near 0 and high negative correlation (i.e., very dissimilar) results in a dissimilarity near 1.

If a similarity score is preferred, you can use

$$s = 1 - d$$

where d is defined as above.

Syntax 1:
LET <par> = PEARSON DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Pearson dissimilarity is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 2:
LET <par> = PEARSON SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Pearson similarity is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Examples:
LET A = PEARSON DISSIMILARITY Y1 Y2
LET A = PEARSON DISSIMILARITY Y1 Y2 SUBSET TAG > 2
LET A = PEARSON SIMILARITY Y1 Y2
Note:
The two variables must have the same number of elements.
Default:
None
Synonyms:
PEARSON DISTANCE is a synonym for PEARSON DISSIMILARITY
Related Commands:
 CORRELATION = Compute the Pearson correlation of two variables. SPEARMAN DISSIMILARITY = Compute the dissimilarity of two variables based on Spearman's rank correlation. KENDALL TAU DISSIMILARITY = Compute the dissimilarity of two variables based on Kendall's tau correlation. COSINE DISTANCE = Compute the cosine distance. MANHATTAN DISTANCE = Compute the Euclidean distance. EUCLIDEAN DISTANCE = Compute the Euclidean distance. MATRIX DISTANCE = Compute various distance metrics for a matrix. GENERATE MATRIX = Compute a matrix of pairwise statistic values.
Reference:
Kaufman and Rousseeuw (1990), "Finding Groups in Data: An Introduction To Cluster Analysis", Wiley.
Applications:
Clustering
Implementation Date:
2017/08:
Program 1:

SKIP 25
LET CORR = CORRELATION Y X
LET D    = PEARSON DISSIMILARITY Y X
PRINT CORR D

The following output is generated
 PARAMETERS AND CONSTANTS--

CORR    --          0.946
D       --          0.027

Program 2:

SKIP 25
READ IRIS.DAT Y1 Y2 Y3 Y4
SET WRITE DECIMALS 3
.
LET M = GENERATE MATRIX PEARSON DISSIMILARITY Y1 Y2 Y3 Y4
PRINT M

The following output is generated

MATRIX M       --            4 ROWS
--            4 COLUMNS

VARIABLES--M1             M2             M3             M4

-0.000          0.559          0.075          0.155
0.559          0.000          0.736          0.534
0.075          0.736          0.000          0.144
0.155          0.534          0.144          0.000

Program 3:

SKIP 25
READ IRIS.DAT Y1 Y2 Y3 Y4 TAG
.
TITLE CASE ASIS
TITLE OFFSET 2
LABEL CASE ASIS
TIC MARK OFFSET UNITS DATA
Y1LABEL Pearson Dissimilarity Coefficient
YLIMITS 0 1
MAJOR YTIC MARK NUMBER 6
MINOR YTIC MARK NUMBER 1
Y1TIC MARK LABEL DECIMAL 1
Y1LABEL DISPLACEMENT 20
X1LABEL Species
XLIMITS 1 3
MAJOR XTIC MARK NUMBER 3
MINOR XTIC MARK NUMBER 0
XTIC MARK OFFSET 0.3 0.3
X1LABEL DISPLACEMENT 14
CHARACTER X BLANK
LINES BLANK SOLID
.
MULTIPLOT CORNER COORDINATES 5 5 95 95
MULTIPLOT SCALE FACTOR 2
MULTIPLOT 2 3
.
TITLE Sepal Length vs Sepal Width
CORRELATION PLOT Y1 Y2 TAG
.
TITLE Sepal Length vs Petal Length
CORRELATION PLOT Y1 Y3 TAG
.
TITLE Sepal Length vs Petal Width
CORRELATION PLOT Y1 Y4 TAG
.
TITLE Sepal Width vs Petal Length
CORRELATION PLOT Y2 Y3 TAG
.
TITLE Sepal Width vs Petal Width
CORRELATION PLOT Y2 Y4 TAG
.
TITLE Petal Length vs Petal Width
CORRELATION PLOT Y3 Y4 TAG
.
END OF MULTIPLOT



NIST is an agency of the U.S. Commerce Department.

Date created: 09/05/2017
Last updated: 09/05/2017