AUTOMATIC VARIABLE NAMES
Dataplot normally reads variable names on the READ command.
However, many ASCII files will have the name of the variables
given directly in the file or Dataplot can assign the variable
names automatically.
Specific methods include the following.
- Many of the sample files provided in the Dataplot
installation use a syntax like
Y X1 X2
----------------
<data values>
For these files, you can enter the commands
SKIP AUTOMATIC
SET VARIABLE NAMES FILE
READ FILE.DAT
In this case, Dataplot will skip all lines until a line
starting with three or more hypens is encountered. It
will then backspace to the previous line and read the
variable names from that line.
Alternatively, you can specify the number of lines to skip with
the command
where N specifies the number of lines in the header. If the
SET VARIABLE NAME FILE command has been given, Dataplot will
look at the last line of the header (i.e., line N). If it
starts with three or more hyphens, then Dataplot assumes that
the variable names are on line N-1. If line N does not start
with hyphens, then Dataplot assumes that the variable names are
on line N.
- Many non-Dataplot ASCII data files will have the variable names on
the first line of the file. For these files, you can enter the
commands
SET READ VARIABLE LABELS ON
READ FILE.DAT
Dataplot then assumes line one of the file contains the variable
names and the data start with line two.
This is equivalent to using
SKIP 1
SET VARIABLE NAMES FILE
READ FILE.DAT
- If you would like Dataplot to simply assign the variable
names, enter the commands
SET VARIABLE NAMES AUTOMATIC
READ FILE.DAT
Dataplot will read the first line of the file to determine
the number of variables. It will then assign the names
COL1, COL2, and so on to the variable names. Prior to
2014/10, Dataplot used default names of X1, X2, and so on.
You can specify what to use for the base of the variable
names by entering the command
SET AUTOMATIC VARIABLE BASE NAME <value>
The use of SET VARIABLE NAMES AUTOMATIC applies to either the
SKIP AUTOMATIC or the SKIP N cases.
Note that Dataplot's usual rules for variable names still apply.
That is, a maximum of eight characters will be used and spaces (commas
can also be used as the delimiter and non-printing characters such as
tabs will be treated as spaces) will delimit variable names. The use
of special (i.e., not a number and not an alphabetic character)
characters is discouraged. You may need to edit the file if the
variable names do not follow these rules (more than eight characters will
simply be ignored, so the issue is more one of duplicate variable names
in the first eight characters).
The SET VARIABLE NAMES <AUTOMATIC/FILE> command was implemented
2014/10. Prior to that, the SKIP AUTOMATIC worked equivalently to the
SET VARIABLE NAMES FILE and the SKIP N worked equivalently to the
SET VARIABLE NAMES AUTOMATIC. The default is AUTOMATIC.
Note 2020/08: The following tweaks were made to the reading of
variable names.
- Previously, only the first 255 characters of the line were read.
This has been extended to support the number of characters
specified by the MAXIMUM RECORD LENGTH command (if this command is
not given, the default remains 255).
- Dataplot will now automatically strip spaces and other special
characters out of the variable names. Specifically, only
alphabetic characters (A-Z), numbers (0-9), and underscores are
retained.
- Dataplot only supports eight characters for variable names. This
can lead to duplicate file names. To reduce the possibility of
duplicate names, Dataplot does the following if a duplicate name
is found.
- If the name has less than eight characters, a "Z" is
appended to the end of one of the names. The right most
name will be modified.
- If the name has eight characters exactly, the right most
name will change the last character to a Z (or if that
character is already a Z, then to a X).
- If blank names are encountered, these will be changed to
Zxxx where "xxx" is a sequence number (i.e., if there are
three blank names encountered, they wiil be set to Z1, Z2,
and Z3).
READING FIXED COLUMNS
By default, Dataplot performs free format reads. That is,
you do not need to line up the columns neatly. You do need
to provide one or more spaces (tabs, commas, colons, semi-colons,
parenthesis, or brackets can be used as well) between data fields.
Many data files will contain fixed fields. There are several reasons
you may want or need to take advantage of these fixed fields rather
than using a free format read.
- If your data fields do not contain spaces (or some other
delimiter) between data columns, you need to tell
Dataplot how to interpret the columns.
- In some cases, you may only want to read selected
variables in the data file.
- Using a formatted read can significantly speed up the reading
of the data. If you have small or moderate size data files
(say 500 rows or fewer), this is really not an issue. However,
if you are reading 50,000 rows, you can significantly speed up
the read by specifying the format.
- If the data fields have unequal lengths, Dataplot will not
interpret the data file correctly with a free format read.
It assigns the data items in the order they are encountered
to the variable names in the order they are given. Dataplot
does not try to guess if a data item is missing based on the
columns.
The issue of unequal lengths is discussed in detail in the
next section.
There are two basic cases for fixed fields.
- The data fields are justified by the decimal point.
In this case, you can use the
command to specify a Fortran-like format to read the file.
Enter HELP READ FORMAT for details.
Using a formatted read is significantly faster than a
free format read.
- Many programs will write ASCII files with fixed columns,
but the data fields will be either left or right justified
rather than lined up by the decimal point.
In this case, you can use a special form of the
COLUMN LIMITS command that was introduced with the
January, 2004 version. Normally, the first and last columns
to read are specified. However, you can now enter variables for
the lower and upper limits as in the following example:
LET LOWER = DATA 1 21 41
LET UPPER = DATA 10 30 50
COLUMN LIMITS LOWER UPPER
That is, if variables rather than parameters are specified,
separate column limits are specified for each data field.
In this case, the first data field is between columns
1 and 10, the second field is between columns 21 and 30, and
the third field is between 41 and 50.
When this syntax is used, only one variable is read for
each specified field. If the field is blank, then this is
interpreted as a missing value.
READING VARIABLES OF UNEQUAL LENGTH
Dataplot normally expects all variables to be of equal length.
If some variables have missing rows, this can have undesired
results. Dataplot will assign the first value read to the
first variable name, the second value to second variable and
so on. If fewer values than variables are specified, then variables
that have no data values are not read at all (even if they have
values for other rows).
If you have a data file where the columns have unequal lengths,
you can do one of the following things.
- Pick some value to represent a missing value and fill
in missing data points with that value. After reading
the data, you can use a RETAIN command to remove them.
For example, if you use -99 to signify a missing value,
you can enter
Alternatively, you can use a SUBSET clause on subsequent
plot and analysis commands.
- Use the variable form of the COLUMN LIMITS command as
described above. By default, when a blank field is
encountered, it is set to zero. You can specify the
value to use by entering the command
SET READ MISSING VALUE <value>
This option depends on having consistent columns for
each of the data fields.
- If your data has both columns of unequal length and
inconsistent columns for given data fields, an alternative
is to use a comma delimited data file. That is, separate
data values with a comma. If there is no data data between
successive commas, this is treated as a missing value. The
default is to assign a value of zero. Alternatively, you
can use the SET READ MISSING VALUE command described above.
You can specify a delimiter other than a comma with the
command
SET READ DELIMITER <character>
The variable form of the COLUMN LIMITS, the
SET READ MISSING VALUE, and the SET READ DELIMITER commands
were introduced in the January, 2004 version. The
interpretation of successive commas as a missing value was
also introduced in the January, 2004 version.
READING DATA WITH CHARACTER FIELDS
Dataplot has not previously supported character data. The one
execption is that you could read row labels with the READ ROW LABEL
command (enter HELP READ ROW LABEL for details). If encountered,
Dataplot would generate an error message and not read the data file
correctly.
With the January 2004 version, we have introduced some limited
support for character data. Specifically, we have added the command
SET CONVERT CHARACTER <ON/CATEGORICAL/IGNORE/ERROR>
Setting this to ERROR will continue the current Dataplot action of
reporting character data as an error. This is recommended for the
case when a file is suppossed to contain only numeric data and the
presence of character data is in fact indicative of an error in the
data file.
Setting this to IGNORE will instruct Dataplot to simply ignore any
fields containing character data. This can be useful if you simply
want to extract the numeric data fields in the file without
entering COLUMN LIMITS or SET READ FORMAT commands.
Setting this to ON will read character fields and write them to the
file "dpzchf.dat". Note that Dataplot saves numeric data
"in memory" for fast access. Since character data has limited
use in Dataplot, we have decided to save character data
externally to minimize memory requirements. Dataplot keeps a
separate name table for the character data fields (the names for
character variables are stored in the file "dpzchf.dat").
Setting this to CATEGORICAL is similar to ON. However, CATEGORICAL
will additionally create a coded numeric variable in addtion to
the character variable. The numeric variable is useful for computing
purposes while the character variable is used for labeling purposes.
There are some restrictions on when Dataplot will try to
read character data:
- This only applies to the variable read case. That
is, READ PARAMETER and READ MATRIX will ignore
character fields or treat them as an error.
- Dataplot will only try to read character data from
a file. When reading from the keyboard (i.e., when
READ is specified with no file name), character data
will be ignored when a SET CONVERT CHARACTER ON is
specified.
Note: The 2020/01 version has removed this restriction.
You can now read character data from the terminal.
- This capability is not supported for the SERIAL READ
case.
- The SET READ FORMAT command does not accept the
"A" format specification for reading character
fields.
- A maximum of 20 character variables will be saved.
- A maximum of 24 characters for each character variable
will be saved.
- The character variables from at most one data file
will be saved in a given session.
Some of these restrictions may be addressed in subsequent
releases of Dataplot.
Currently, Dataplot has limited support for character variables.
Specifically,
- The row label can be used for the plot character by
entering the command
- You can convert a character variable to a coded numeric
variable with the command
LET Y = CHARACTER CODE IX
LET Y = ALPHABETIC CHARACTER CODE IX
with IX denoting the name of the character variable. These
command assigns a numeric value for each unique name in
the character variable.
For the CHARACTER CODE case, the coding is from 1 to K where
K is the number of unique values. The order is based on
the order these values are found in the file.
For the ALPHABETIC CHARACTER CODE case, the coding is from
1 to K where K is the number of unique values. The order is
performed in alpabetical order.
We anticipate additional use of character variables in subsequent
releases of Dataplot.
If your character fields contain non-numeric/non-alphabetic characters,
then it is recommended that the character fields be enclosed in
quotes. When Dataplot encounters a quote (either a single or double
quote), it interprets everything until a matching quote is found
as part of that character field. If the quotes are not used,
then spaces, tabs, parenthesis, brackets, colons, and semi-colons
are interpreted as delimiters that signify the end of that data item.
READING ROW ORIENTED DATA
Dataplot assumes a column oriented format. That is, a row of
data represents a single record (or case) and a column of data
represents a variable. If a data file has a row orientation, then
this is reversed. A row of data represents a variable and a column
of data represents a record (or case).
The following example shows one way of correctly reading the data
into Dataplot. Suppose that your data file contains five rows with
each row corresponding to a single variable. You can do the following:
LOOP FOR K = 1 1 5
ROW LIMITS K K
SERIAL READ FILE.DAT X^K
END OF LOOP
COMMENT LINES IN DATA FILES
It is sometimes convenient to include comments in data files.
If these comments are contained at the beginning of the file, then
the SKIP command can be used. To have Dataplot check for comment
lines in the data file, enter the command
The default comment character is a ".". That is, any line starting
with a ". " is treted as a comment line and ignored. To specify
a different comment character, enter the command
with denoting the desired comment character.
EXCEL FILES
At the current time, Dataplot does not support the
direct reading of Excel data files. We are planning to add
this capability in a future release of Dataplot. Until that
time, you need to save the data in Excel to an ASCII file and
read that ASCII file into Dataplot.
Excel provides the following options for writing ASCII data
files:
- Formatted text (space delimited) (.PRN extension)
This format will use consistent columns for the data fields.
The variable form of the COLUMN LIMITS command can be used
when the data columns have unequal length.
Character fields will often not have the separating space. The
variable form of the COLUMN LIMITS command can be used in this
case as well.
- CSV (Comma delimited) (.CSV extension)
This format will separate data fields with a single comma.
Missing data is represented with successive commas. Dataplot
can now (as of the January 2004 version) handle this correctly.
- Text (Tab delimited) (.TXT extension)
Text (MS-DOS) (.TXT extension)
These files will separate data fields with a tab character.
Note that Dataplot converts all non-printing characters
(including tabs) to a single space character.
This format is not appropriate for data containing variables
with unequal lengths since it will not generate consistent
columns for the data fields. Use either the space delimited
or comma delimited file for that case.
Other spreadsheets will typically have similar options. However, the
details may vary depending on the specific spreadsheet program.
Note that directly exporting the Excel data to an ASCII file tends
to work well when the data sheet is "clean". That, is you basically
have a single rectangular set of data cells. If your spreadsheet
contains graphs or multiple rectangular areas of data, then the
generated ASCII file will tend to be difficult to work with. In
this case, it is recommended that you either copy the relevant data
to a clean sheet or paste it into an ASCII editor (e.g., Notepad or
Wordpad) and save the file from there.
The next section discusses cut and paste within Dataplot. In many
cases, this may provide the simplest way to retrieve data from a
spreadsheet.
Note 2020/02: Dataplot added a READ EXCEL command. This command
utilizes Python (and specifically the Pandas package) to read Excel
files. Enter HELP READ EXCEL for details.
CUT AND PASTE
In some contexts, it may be desirable to simply cut and paste relevant
data into Dataplot. For example, this can provide an alternative way
to import data from spreadsheets and other statistical software.
How well cut and paste works is dependent on the operating system and
the compiler used to build Dataplot. If you do this, we recommend
limiting it to the case of numeric data.
Some specific cases are
- Linux systems with the gfortran compiler
As an example, suppose that the system clipboard contains three
columns of data. You can then do something like
READ Y X1 X2
<paste the contents of the clipboard>
END OF DATA
That is, you do a terminal read, paste your data, and then
enter an END OF DATA command to terminate the READ.
- Windows systems with the Intel compiler
For earlier versions of the compiler, the operations described
for Linux/gfortran worked for this environment as well. However,
for the version of the compiler currently being used, this is
no longer working. Some testing has shown that it works for one
or two lines, but Dataplot crashes for more than that. We are
investigating this to see if we can get it working again.
- Tcl/Tk GUI
The spreadsheet in the Dataplot Tcl/Tk GUI does not accept
paste operations in the data spreadsheet.
Note: The 2014/12 version of Dataplot added numerous commands for
reading from (and writing to) the clipboard.
The initial implementation is for the Windows environments, although
this should be extended to Linux and Mac OS X platforms in subsequent
releases.
COMMA AS DECIMAL POINT
Dataplot follows the United States convention where the decimal
point is the period ".". Some locales may use a different
character to denote the decimal point. In particular, some
countries use the comma ",".
To allow Dataplot to read files that use a character other than
the "." for the decimal point, enter the command
SET DECIMAL POINT <value>
where <value> denotes the character that specifies the decimal
point.
Note this support is fairly limited. Specifically, it applies
to free-format reads (i.e., no SET READ FORMAT command has been
entered). In addition,
- This option is not supported for the WRITE command. WRITE
will always use a period for the decimal point.
- Dataplot alphanumeric output (e.g., the output from the FIT
command) is generated with the period as the decimal point.
- As mentioned above, if you read your data with a
SET READ FORMAT command, the data must use the period
for the decimal point.
MISSING VALUES AND UNDEFINED NUMBERS
Some software programs will have special characters to denote
missing values or undefined values (e.g., the result of trying
to divide by 0).
In particular, Unix/Linux software often uses "nan" to denote an
undefined number. If Dataplot encounters an "nan" in a numeric
field, it will convert it to the Dataplot "missing value". The "nan"
search is not case sensitive (i.e., it will check for "NAN", "NaN",
etc.). You can specify what Dataplot will use for the missing value
by entering the command
SET READ MISSING VALUE <value>
where <value> is a numeric value.
Missing value flags are specific to individual programs. You can
specify a character string that denotes a missing value with the
command
SET DATA MISSING VALUE <value>
where <value> is a string with 1 to 4 characters. If Dataplot
encounters <value> in a numeric field, it will convert it to the
Dataplot "missing value". The missing value string is not case
sensitive. You can specify what Dataplot will use for the missing
value by entering the command
SET READ MISSING VALUE <value>
where <value> is a numeric value.
The above discussion was for missing numeric data. If you have
missing data for a character field, you can specify the string
that will denoted missing data by entering the command
SET READ CHARACTER MISSING VALUE
The default string is ZZZZNULL.
READING DATE AND TIME FIELDS
Date and time fields will typically have syntax like
Dataplot treats the "/" and ":" as indicating character fields
(based on the SET CHARACTER CONVERT command, this will either cause
an error, result in this field being ignored, or the field being
read as a character variable).
The following commands were added (2016/06) to help deal with date and
time fields.
SET DATE DELIMITER <character>
SET TIME DELIMITER <character>
Although Dataplot does not have explicit date or time variables,
these commands allow the components of date and time fields to
be read as separate numeric variables. For example,
SET DATE DELIMITER /
SET TIME DELIMITER :
READ YEAR MONTH DAY HOUR MIN SEC
2016/06/22 23:19:03
END OF DATA
READING IP ADDRESSES
IP addresses typically have a syntax like
By default, Dataplot will generate an error when trying to read a
field of this type. To address this, you can enter the command
If this switch is ON, Dataplot will scan the line and if a field is
encountered that conains more than one period ".", Dataplot will
convert these periods to spaces before parsing the line.
The default is OFF since this adds additional processing time to
the READ and most data sets do not contain IP addresses.
READING MONETARY DATA
Monetary data may sometimes be given as
The "$" and "," in these numeric fields will cause problems. The
"$" will be treated as a non-numeric value (depending on other
SET commands, this will be treated as an error or the numeric field
will be read as a character field). The comma is typically treated
as a field delimiter. If you have this kind of data, enter the
commands
set read dollar sign ignore on
set read comma ignore on
To reset the defaults, enter
set read dollar sign ignore off
set read comma ignore off
Note that if you enter the SET READ COMMA IGNORE ON command, the
comma will no longer be treated as the delimiter. Dataplot cannot
currently handle the case where the comma is used both for monetary
data and also as a field delimiter.