 Dataplot Vol 2 Vol 1

MATRIX DISTANCE

Name:
MATRIX DISTANCE (LET)
Type:
Let Subcommand
Purpose:
Compute the distance matrix of a matrix.
Description:
Dataplot can compute the distances relative to either rows or columns.

Given an nxp data matrix X, we compute a distance matrix D. For row distances, the Dij element of the distance matrix is the distance between row i and row j, which results in a nxn D matrix. For column distances, the Dij element of the distance matrix is the distance between column i and column j, which results in a pxp D matrix.

Five distance metrics are available (the 2018/10 version of Dataplot added several additional distances).

1. The Euclidean row distance between rows i and <>j is defined as

$$D_{ij} = \sqrt{\sum_{k=1}^{p}{(X_{ik} - X_{jk})^{2}}}$$

The Euclidean column distance is defined as

$$D_{ij} = \sqrt{\sum_{k=1}^{n}{(X_{ki} - X_{kj})^{2}}}$$

The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.

2. The Mahalanobis distance is defined as

$$D_{ij} = \sqrt{(X_i - X_j)' S^{-1} (X_i - X_j)}$$

where $$S^{-1}$$ is the inverse of the variance-covariance matrix of X. The row distances are obtained by letting Xi and Xj represent the i-th and j-th row while the column distances are obtained by letting Xi and Xj represent the i-th and j-th columns.

The Mahalanobis distance is is effectively a weighted Euclidean distance where the weighting is determined by the sample variance-covariance matrix.

3. The Minkowsky row distance is defined as

$$D_{ij} = \sum_{k=1}^{p}{(|X_{ik} - X_{jk}|^{P})^{1/P}}$$

The column distance is similar, but the summation is over the number of rows rather than the number of columns.

The Minkowsky distance is the P-th root of the sum of the absolute differences to the P-th power between corresponding elements of the rows (or columns). The Euclidean distance is the special case of P = 2.

4. The block row distance is defined as

$$D_{ij} = \sum_{k=1}^{p}{|X_{ik} - X_{jk}|}$$

The column distance is similar, but the summation is over the number of rows rather than the number of columns.

The block distance is the sum of the absolute differences between corresponding elements of the rows (or columns). Note that this is a special case of the Minkowsky distance with P = 1.

The block distance is also known as the city block or Manhattan distance.

5. The Chebychev row distance is defined as

$D_{ij} = \max_{k} |X_{ik} - X_{jk}|$

The column distance is similar, but the maximum is over the rows rather than the columns.

6. The cosine row similarity is defined as

$$\mbox{Cosine Similarity} = \frac{\sum_{k=1}^{n}{x_{ik} y_{jk}}} {\sqrt{\sum_{k=1}^{n}{x_{ik}^{2}}} \sqrt{\sum_{k=1}^{n}{y_{jk}^{2}}}}$$

The cosine distance is then defined as

$$\mbox{Cosine Distance} = 1 - \mbox{Cosine Similarity}$$

The cosine distance above is defined for positive values only. It is also not a proper distance in that the Schwartz inequality does not hold. However, the following angular definitions are proper distances:

$$\mbox{angular cosine distance} = \frac{\mbox{c} \arccos(\mbox{cosine similarity})} {\pi}$$

with $$\arccos$$ designating the arccosine function and where c = 2 if there are no negative values and c = 1 if there are negative values.

$$\mbox{angular cosine similarty} = 1 - \mbox{angular cosine distance}$$

If negative values are encountered in the input, the cosine distances will not be computed. However, the cosine similarities will be computed.

The column distance and similarity are defined similarly, but the summations are over the rows rather than the columns.

7. The Canberra row distance is defined as

$$D_{ij} = \sum_{k=1}^{p} {\frac{|X_{ik} - Y_{jk}|} {|X_{ik}| + |Y_{jk}|}}$$

The column distance is similar, but the summation is over the rows rather than the columns.

The Canberra distance is a weighted version of the block (Manhattan) distance.

8. The Jaccard row similarity is defined as

$$S_{ij} = \frac{\sum_{k=1}^{p}{\min(X_{ik},Y_{jk})}} {\sum_{k=1}^{p}{\max(X_{ik},Y_{jk})}}$$

Then the Jaccard row distance is defined as

$$D_{ij} = 1 - S_{ij}$$

The Jaccard column distance and similarity are defined similarly, but the summation is over the rows rather than the columns.

9. The Pearson row distance is defined as

$$D_{ij} = (1 - R_{ij})/2$$

where Rij is the correlation coefficient between rows i and j.

The Pearson row similarity is then defined as

$$S_{ij} = 1 - D_{ij}$$

The Pearson column distance and similarity are defined similarly, but the correlation is over the rows rather than the columns.

10. The Hamming row distance is defined as

$$D_{ij}$$ = number of elements that differ in between rows i and j

The column distance is similar, but the number of elements that differ is compared between two columns rather than two rows.

Many multivariate techniques are based on distance matrices.

Syntax 1:
LET <mat2> = <type> ROW DISTANCE <mat1>
where <mat1> is a matrix for which the matrix distance is to be computed;
<type> is EUCLIDEAN, MAHALANOBIS, MINKOWSKY, BLOCK, CHEBYCHEV, CANBERRA, JACCARD, PEARSON, COSINE, ANGULAR COSINE, or HAMMING and defines the type of distance to compute;
and where <mat2> is a matrix where the resulting distance matrix is saved.

This syntax computes row distances.

Syntax 2:
LET <mat2> = <type> COLUMN DISTANCE <mat1>
where <mat1> is a matrix for which the matrix distance is to be computed;
<type> is EUCLIDEAN, MAHALANOBIS, MINKOWSKY, BLOCK, CHEBYCHEV, CANBERRA, JACCARD, PEARSON, COSINE, ANGULAR COSINE, or HAMMING and defines the type of distance to compute;
and where <mat2> is a matrix where the resulting distance matrix is saved.

This syntax computes column distances.

Syntax 3:
LET <mat2> = <type> ROW SIMILARITY <mat1>
where <mat1> is a matrix for which the matrix similarity is to be computed;
<type> is JACCARD, PEARSON, COSINE, or ANGULAR COSINE and defines the type of similarity to compute;
and where <mat2> is a matrix where the resulting similarity matrix is saved.

This syntax computes row similarities.

Syntax 4:
LET <mat2> = <type> COLUMN SIMILARITY where <mat1> is a matrix for which the matrix similarity is to be computed;
<type> is JACCARD, PEARSON, COSINE, or ANGULAR COSINE and defines the type of similarity to compute;
and where <mat2> is a matrix where the resulting similarity matrix is saved.

This syntax computes column similarities.

Examples:
LET D = EUCLIDEAN ROW DISTANCE M
LET D = EUCLIDEAN COLUMN DISTANCE M

LET D = BLOCK ROW DISTANCE M
LET D = BLOCK COLUMN DISTANCE M

LET D = MAHALANOBIS ROW DISTANCE M
LET D = MAHALANOBIS COLUMN DISTANCE M

LET P = 1.5
LET D = MINKOWSKY ROW DISTANCE M
LET D = MINKOWSKY COLUMN DISTANCE M

LET D = COSINE ROW DISTANCE M
LET D = COSINE COLUMN DISTANCE M

LET D = COSINE ROW SIMILARITY M
LET D = COSINE COLUMN SIMILAITY M

LET D = JACCARD ROW DISTANCE M
LET D = JACCARD COLUMN DISTANCE M

LET D = JACCARD ROW SIMILARITY M
LET D = JACCARD COLUMN SIMILAITY M

LET D = PEARSON ROW DISTANCE M
LET D = PEARSON COLUMN DISTANCE M

LET D = PEARSON ROW SIMILARITY M
LET D = PEARSON COLUMN SIMILARITY M

Note:
Matrices are created with either the READ MATRIX command or the MATRIX DEFINITION command. Enter HELP MATRIX DEFINITION and HELP READ MATRIX for details.
Note:
For the Minkowsky distance, you need to specify the value of P. This is done by entering the following command before entering the MINKOWSKY DISTANCE command:

LET P = <value>
Note:
It is often desirable to scale the matrix before computing the distances. Dataplot provides several scaling options. Enter HELP MATRIX SCALE for details.
Note:
The correlation matrix and covariance matrix can be considered distance matrices as well.
Default:
None
Synonyms:
None
Related Commands:
 READ MATRIX = Read a matrix. MATRIX COLUMN DIMENSION = = Dimension maximum number of columns for Dataplot matrices. CORRELATION MATRIX = Compute the correlation matrix. VARIANCE-COVARIANCE MATRIX = Compute the variance-covariance matrix. DISTANCE FROM MEAN = Compute the distance from the mean for a matrix.
Reference:
"Graphical Exploratory Data Analysis", Du Toit, Steyn, and Stumpf, Springer-Verlang, 1986, pp. 74-77.

"Applied Multivariate Statistical Analysis", Third Edition, Johnson and Wichern, Prentice-Hall, 1992.

Applications:
Multivariate Analysis
Implementation Date:
1998/08
Program:
dimension 100 columns
set write decimals 4
.
let iflag1 = 1
. let iflag1 = 2
. let iflag1 = 3
. let iflag1 = 4
. let iflag1 = 5
. let iflag1 = 6
. let iflag1 = 7
. let iflag1 = 8
. let iflag1 = 9
. let iflag1 = 10
. let iflag1 = 11
let iflag2 = 1
. let iflag2 = 2
.
skip 25
.
if iflag1 = 1
if iflag2 = 1
let v = euclidean column distance x
else if iflag2 = 2
let v = euclidean row distance x
end of if
else if iflag1 = 2
if iflag2 = 1
let v = block column distance x
else if iflag2 = 2
let v = block row distance x
end of if
else if iflag1 = 3
let p = 1.5
if iflag2 = 1
let v = minkowski column distance x
else if iflag2 = 2
let v = minkowski row distance x
end of if
else if iflag1 = 4
if iflag2 = 1
let v = chebychev column distance x
else if iflag2 = 2
let v = chebychev row distance x
end of if
else if iflag1 = 5
if iflag2 = 1
let v = jaccard column distance x
else if iflag2 = 2
let v = jaccard row distance x
end of if
else if iflag1 = 6
if iflag2 = 1
let v = jaccard column similarity x
else if iflag2 = 2
let v = jaccard row similarity x
end of if
else if iflag1 = 7
if iflag2 = 1
set isubro cdis
let v = cosine column distance x
set isubro
else if iflag2 = 2
let v = cosine row distance x
end of if
else if iflag1 = 8
if iflag2 = 1
let v = cosine column similarity x
else if iflag2 = 2
let v = cosine row similarity x
end of if
else if iflag1 = 9
if iflag2 = 1
let v = hamming column distance x
else if iflag2 = 2
let v = hamming row distance x
end of if
else if iflag1 = 10
if iflag2 = 1
let v = canberra column distance x
else if iflag2 = 2
let v = canberra row distance x
end of if
else if iflag1 = 11
if iflag2 = 1
let v = pearson column distance x
else if iflag2 = 2
let v = pearson row distance x
end of if
end of if
.
print v

The following output is returned

MATRIX V       --            5 ROWS
--            5 COLUMNS

VARIABLES--V1             V2             V3             V4             V5

0.0000        36.1578        35.2624        61.4890        47.5358
36.1578         0.0000        30.1846        29.3828        18.4770
35.2624        30.1846         0.0000        37.6646        24.9542
61.4890        29.3828        37.6646         0.0000        14.9040
47.5358        18.4770        24.9542        14.9040         0.0000

NIST is an agency of the U.S. Commerce Department.

Date created: 06/05/2001
Last updated: 10/12/2018