ROC CURVE

Name:

ROC CURVE (LET) Type:

Graphics Command Purpose:

Generate a Reciever Operating Characterisitc (ROC) curve. Description:

	Variable 2
Variable 1	Success	Failure	Row Total

Success	N₁₁	N₁₂	N₁₁ + N₁₂
Failure	N₂₁	N₂₂	N₂₁ + N₂₂

Column Total	N₁₁ + N₂₁	N₁₂ + N₂₂	N

The parameters N₁₁, N₁₂, N₂₁, and N₂₂ denote the counts for each category.

Success and failure can denote any binary response. Dataplot expects "success" to be coded as "1" and "failure" to be coded as "0". Some typical examples would be:

Variable 1 denotes whether or not a patient has a disease (1 denotes disease is present, 0 denotes disease not present). Variable 2 denotes the result of a test to detect the disease (1 denotes a positive result and 0 denotes a negative result).
Variable 1 denotes whether an object is present or not (1 denotes present, 0 denotes absent). Variable 2 denotes a detection device (1 denotes object detected and 0 denotes object not detected).

In these examples, the "ground truth" is typically given as variable 1 while some estimator of the ground truth is given as variable 2. In the above table, we can define the following quantities:

True Positives = N11 (i.e., number of cases where disease present and test detects it)
True Negatives = N22 (i.e., number of cases where disease not present and test did not detect it)
False Positives = N21 (i.e., number of cases where disease not present and test detects it)
False Negatives = N12 (i.e., number of cases where disease is present and test does not detect it)
Sensitivity = N11/(N11+N12) (i.e., the probability that the test detects the disease given that the disease is present)
Specificity = N22/(N21+N22) (i.e., the probability that the test does not detect the disease given that the disease is not present)

The ROC curve is a plot of the sensitivity versus 1 - the specificity. Points in the upper left corner (i.e., high sensitivity and high specificity) are desirable.

We have two typical scenarios for generating the ROC curve.

We have a medical test and we want to determine an optimal level for deciding whether the disease is present. Setting the level too low results in too many false negatives (i.e., we fail to detect the disease when it is in fact present). This is low sensitivity. On the other hand, if we set the level too high we may obtain too many false positives (i.e., we detect the disease when it is in fact not present). This is low specificity.
In this case, we typically want to generate the ROC curve as a connected line to show the tradeoff between sensitivity and specificity as we change the threshold level.
We are testing sensors to determine which provides the best performance in detecting some substances.
Since these are distinct devices, we would typically want to plot these as distinct points rather than as a connected curve.

You can also combine these scenarios. That is, we may testing multiple devices (scenario 2) where each device may have multiple settings.

Syntax 1:

This syntax is used for the cases where

we have multiple settings for a single machine
we have multiple machines each with a single setting

Syntax 2:

This syntax is used for the case where we have multiple settings for multiple machines.

Examples:

Note:

Points in the upper left corner denote high accuracy.
Dataplot draws a line from the (0,0) point to the (1,1) point. This is referred to as the no discrimination line. Points falling on this line indicate a test that is no better than flipping a coin.

For the case where we are changing the threshold of a test, the ROC curves does an excellent job of demonstrating the tradeoff between specificity and sensitivity. That is, as we decrease the chance of a false negative (i.e., we do not miss detection), we inevitably increase the chance of a false positive. So what we are looking for is a test that follows the left x-axis and then the top y-axis. In other words, the closer the curve is to the no discrimination line, the poorer the test.

Note:

The three (or four) variables must have the same number of elements. Note:

Raw data - in this case, the variables contain 0's and 1's.
If the data is not coded as 0's and 1's, Dataplot will check for the number of distinct values. If there are two distinct values, the minimum value is converted to 0's and the maximum value is converted to 1's. If there is a single distinct value, it is converted to 0's if it is less than 0.5 and to 1's if it is greater than or equal to 0.5. If there are more than two distinct values, an error is returned.
Summary data - if there are two observations, the data is assummed to be the 2x2 summary table. That is,

Note:

In one case, we may want to compare instruments from different vendors. In this case, the ROC curve would be used to help determine which vendor has the best instrument.
In the other case, we may be able to change the level at which we determine whether or not we have detected the object. In this case, the ROC curve can be used to help determine an optimal setting for the instrument. In this case, there is typically a trade-off between sensitivity and specificity (i.e., as our instrument becomes more sensitive to the prescence of the object, we also increase the probability of a false positive).

Of course, we can have a combination of these cases (i.e., multiple instruments each with multiple possible settings).

Note:

For Syntax 1, the following traces are generated for the plot:

trace 1 - a line from (0,0) to (1,1). This is the "no discrimination line".
trace 2 - a curve containing all the points on the ROC curve.
trace 3 and above - each point is drawn as a separate trace. This is useful for the case when each point represents a distinct instrument.

For Syntax 2, the following traces are generated for the plot:

trace 1 - a line from (0,0) to (1,1). This is the "no discrimination line".
trace 2 and above - each curve contains all the settings for one group (i.e., trace 2 contains the settings for group 1, trace 3 contains the settings for group 2, and so on).

Note:

This statistic is only meaningful for the case where we are plotting different settings of the same instrument.

For the case where we have multiple settings for multiple vendors, we write the AUC statistic to the file dpst1f.dat (in the current directory). Column 1 contains the group-id value and column 2 contains the value of the AUC statistic for that group.

Default:

None Synonyms:

None Related Commands:

ROSE PLOT	= Generate a Rose plot.
BINARY TABULATION PLOT	= Generate a binary tabulation plot.
TRUE POSITIVES	= Compute the proportion of true positives.
FALSE POSITIVES	= Compute the proportion of false positives.
TRUE NEGATIVES	= Compute the proportion of true negatives.
FALSE NEGATIVES	= Compute the proportion of false negatives.
POSITIVE PREDICTIVE VALUE	= Compute the positive predictive value.
NEGATIVE PREDICTIVE VALUE	= Compute the negative predictive value.
TEST SENSITIVITY	= Compute the test sensitivity.
TEST SPECIFICITY	= Compute the test specificity.
ODDS RATIO	= Compute the bias corrected odds ratio.
LOG ODDS RATIO	= Compute the bias corrected log(odds ratio).
ODDS RATIO STANDARD ERROR	= Compute the standard error of the bias corrected log(odds ratio).

Reference:

Applied Logistic Regression

Applications:

Categorical Data Analysis Implementation Date:

2007/7: Support for syntax 2 added Program 1:

let n = 1
.
let p = 0.2
let y1 = binomial rand numb for i = 1 1 100
let p = 0.1
let y2 = binomial rand numb for i = 1 1 100
.
let p = 0.4
let y1 = binomial rand numb for i = 101 1 200
let p = 0.08
let y2 = binomial rand numb for i = 101 1 200
.
let p = 0.15
let y1 = binomial rand numb for i = 201 1 300
let p = 0.18
let y2 = binomial rand numb for i = 201 1 300
.
let p = 0.6
let y1 = binomial rand numb for i = 301 1 400
let p = 0.45
let y2 = binomial rand numb for i = 301 1 400
.
let p = 0.3
let y1 = binomial rand numb for i = 401 1 500
let p = 0.1
let y2 = binomial rand numb for i = 401 1 500
.
let x = sequence 1 100 1 5
.
limits 0  1
major xtic mark number 6
minor xtic mark number 1
tic mark offset 0.05 0.05
.
character blank blank 1 2 3 4 5
line blank all
line dotted
.
title case asis
title offset 2
title ROC Curve
y1label Sensitivity
x1label 1 - Specificity
.
roc curve y1 y2 x

Program 2:

 
.  Following sample data from Wikipedia site
read y1 y2 x
63 37 1
28 72 1
77 23 2
77 23 2
24 76 3
88 12 3
88 12 4
24 76 4
end of data
.
character blank blank A B C D
line blank all
line dotted
limits 0 1
major tic mark number 6
minor tic mark number 1
tic mark offset 0.05 0.05
.
label case asis
title case asis
title offset 2
title ROC Curve
y1label Sensitivity
x1label 1 - Specificity
.
roc curve y1 y2 x