![]() |
BINARY MATCH DISSIMILARITY
Name:
|
Variable 2 | |||
---|---|---|---|
Variable 1 | Not Present | Present | Row Total |
|
|||
Not Present | A | B | A + B |
Present | C | D | C + D |
|
|||
Column Total | A + C | B + D | A + B + C + D |
In the data, we use a value of "0" to denote "not present" and a value of "1" to denote "present".
The parameters A, B, C, and D denote the counts for each category. The various matching statistics combine A, B, C, and D in various ways. A distinction is made between "symmetric" and "asymmetric" matching statistics. Symmetric statistics are typically preferred when the "0" and the "1" outcome are considered equally meaningful. Asymmetric statistics are preferred when the "1" outcome is more meaningful. The case where matching the presence of rare events is what is considered important is an example where the asymmetric scores would be recommended.
Specifically
Matching Coefficient: | \( \frac{A + D} {A + B + C + D} \) | |
Rogers and Tanimoto: | \( \frac{A + D} {(A + D) + 2(B + C)} \) | |
Sokal and Sneath: | \( \frac{2(A + D)} {2(A + D) + (B + C)} \) |
Dissimilarity:
Matching Coefficient: | \( \frac{B + C} {A + B + C + D} \) | |
Rogers and Tanimoto: | \( \frac{2(B + C)} {(A + D) + 2(B + C)} \) | |
Sokal and Sneath: | \( \frac{B + C} {2((A + D) + (B + C)} \) |
Asymmetric Binary Variables (most important value coded as 1)
Jaccard Coefficient: | \( \frac{A}{A+B+C} \) | |
Dice Coefficient: | \( \frac{2A}{2A + B + C} \) | |
Sokal Coefficient: | \( \frac{A}{A + 2(B + C)} \) |
Dissimilarity:
Jaccard Coefficient: | \( \frac{B + C}{A + B + C} \) | |
Dice Coefficient: | \( \frac{B + C}{2A + B + C} \) | |
Sokal Coefficient: | \( \frac{2(B + C)}{A + 2(B + C)} \) |
Three related statistics are
Yule's Q: | \( \frac{A*D - B*C}{A*D + B*C} \) | |
Yule's Y: | \( \frac{\sqrt{A*D} - \sqrt{B*C}} {\sqrt{A*D} + \sqrt{B*C}} \) | |
Youden index: | \( \frac{A*D - B*C}{(A+B)(C+D)} \) |
These statistics are often used to create dissimilarity or similarity matrices that will be used as input to various multivariate procedures such as clustering.
The above statstics where taken from Kauffman and Rousseeuw (see Reference below). They recommend using the matching coefficient for the symmetric case and the Jaccard coefficient for the asymmetric case. However, the above list is not exhaustive and other authors recommend other choices. Also, other sources may have somewhat different formulas for these statistics.
The Youden index (also known as Youden's J statistic) can also be expressed as "sensitivity + specificity - 1". It has a value from 0 (a test gives the same proportion of positive results for groups with and without the disease, i.e., the test has no value) to 1 (there are no false positives and no false negatives).
Yule's Q can take a value from -1 to +1 where -1 indicates total negative correlation, 0 indicates no association, and +1 indicates total positive correlation. Yule's Q is related to the odds ratio in the following way
Yule's Y can be defined in terms of Yule's Q as
or in terms of the odds ratio
Yule's Y is also known as the coefficient of colligation.
If the data is not coded as 0's and 1's, Dataplot will check for the number of distinct values. If there are two distinct values, the minimum value is converted to 0's and the maximum value is converted to 1's. If there is a single distinct value, it is converted to 0's if it is less than 0.5 and to 1's if it is greater than or equal to 0.5. If there are more than two distinct values, an error is returned.
PEARSON DISSIMILARITY | = | Compute the dissimilarity of two variables based on Pearson correlation. |
SPEARMAN DISSIMILARITY | = | Compute the dissimilarity of two variables based on Spearman's rank correlation. |
KENDALL TAU DISSIMILARITY | = | Compute the dissimilarity of two variables based on Kendall's tau correlation. |
COSINE DISTANCE | = | Compute the cosine distance. |
MANHATTAN DISTANCE | = | Compute the Euclidean distance. |
EUCLIDEAN DISTANCE | = | Compute the Euclidean distance. |
MATRIX DISTANCE | = | Compute various distance metrics for a matrix. |
GENERATE MATRIX <stat> | = | Compute a matrix of pairwise statistic values. |
CLUSTER | = | Perform a cluster analysis. |
Youden, W.J. (1950). "Index for rating diagnostic tests," Cancer, No. 3, pp. 32–35.
Yule, G. Udny (1912), "On the Methods of Measuring Association Between Two Attributes," Journal of the Royal Statistical Society, Vol. 75, No. 6, pp. 579–652.
. Example from page 24 of Kaufman and Rousseeuw text. . The rows are 8 people and the columns are 10 binary variables . set write decimals 3 dimension 100 columns . read matrix x 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 end of data . let d = generate matrix binary match dissimilarity ... x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 print d1 d2 d3 d4 d5 print d6 d7 d8 d9 d10 . let ad = generate matrix binary jaccard dissimilarity ... x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 print ad1 ad2 ad3 ad4 ad5 print ad6 ad7 ad8 ad9 ad10The following output is generated
--------------------------------------------------------------------------- D1 D2 D3 D4 D5 --------------------------------------------------------------------------- 0.000 0.375 0.375 0.500 0.250 0.375 0.000 0.750 0.875 0.125 0.375 0.750 0.000 0.375 0.625 0.500 0.875 0.375 0.000 0.750 0.250 0.125 0.625 0.750 0.000 0.500 0.625 0.625 0.250 0.500 0.250 0.625 0.125 0.500 0.500 0.250 0.125 0.625 0.750 0.250 0.375 0.250 0.500 0.625 0.375 0.500 0.625 0.125 0.500 0.500 --------------------------------------------------------------------------- D6 D7 D8 D9 D10 --------------------------------------------------------------------------- 0.500 0.250 0.250 0.375 0.500 0.625 0.625 0.125 0.250 0.625 0.625 0.125 0.625 0.500 0.125 0.250 0.500 0.750 0.625 0.500 0.500 0.500 0.250 0.375 0.500 0.000 0.750 0.500 0.375 0.500 0.750 0.000 0.500 0.625 0.250 0.500 0.500 0.000 0.125 0.500 0.375 0.625 0.125 0.000 0.375 0.500 0.250 0.500 0.375 0.000 --------------------------------------------------------------------------- AD1 AD2 AD3 AD4 AD5 --------------------------------------------------------------------------- 0.000 0.500 0.429 0.571 0.333 0.500 0.000 0.750 0.875 0.200 0.429 0.750 0.000 0.429 0.625 0.571 0.875 0.429 0.000 0.750 0.333 0.200 0.625 0.750 0.000 0.571 0.714 0.625 0.333 0.571 0.333 0.714 0.167 0.571 0.571 0.333 0.200 0.625 0.750 0.333 0.429 0.333 0.500 0.625 0.429 0.500 0.625 0.143 0.500 0.500 --------------------------------------------------------------------------- AD6 AD7 AD8 AD9 AD10 --------------------------------------------------------------------------- 0.571 0.333 0.333 0.429 0.500 0.714 0.714 0.200 0.333 0.625 0.625 0.167 0.625 0.500 0.143 0.333 0.571 0.750 0.625 0.500 0.571 0.571 0.333 0.429 0.500 0.000 0.750 0.571 0.429 0.500 0.750 0.000 0.571 0.625 0.286 0.571 0.571 0.000 0.167 0.500 0.429 0.625 0.167 0.000 0.375 0.500 0.286 0.500 0.375 0.000
Privacy
Policy/Security Notice
Disclaimer |
FOIA
NIST is an agency of the U.S.
Commerce Department.
Date created: 09/20/2017
Last updated: 08/30/2019
Please email comments on this WWW page to
[email protected].