Indexing your own collection

List of Tasks for Building a Collection

Building your data
1. Format your data using SGML.
  - A minimal set of tags is defined here.
2. Create a dtd file to describe your data.
  - Verify your data conforms to the SGML standard by running sgmls on your data first.
    > sgmls -s <dtd file> <data file> >& sgmls.errors
Building user defined files
1. Create a "list" file.
2. Create an "sgmls.actions" file.
3. Create a "title_tags" file.
4. Create a "commonwords" file. (default file provided)
5. Create an "options.spec" file. (default file provided)
6. Create a "fields.spec" file.
Running build.script.sh

build.script.sh - Runs the PRISE indexing programs on a given set of data. It generates all the necessary files for testing with the Z39.50 client/server, except the user generated files mentioned above.
To index the cranfield data from the cranfield subdirectory.
```
> build.script.sh . list
```
Build.script.sh requires the work or collection directory as a first parameter usually the directory you are currently in ".", and the list file as a second parameter, build.script.sh work_directory list_file build.script.sh USAGE:
```
------------------------------------------------------------------------
Usage: build.script.sh <work_dir> <listfile> [-d <dir>] [-m] 

Usage: <work_dir> <list_file> [-m] 

        <work_dir> is the path/directory where your index is created.
        <list_file> of the form path/filename listing all data 
                    files. The path must be relative to the work_dir 
        -m    lists other special options
------------------------------------------------------------------------
```

Identifying your new collection to a zserver

You can make your newly indexed collection available to a zserver by editing the "db_location.spec" file located in the "defaultdir" directory for that zserver, assuming the collection index files are accessible to the zserver. Comments at the top of the "db_location.spec" file explain the format of the contents.

Data files

A minimal set of SGML tags for PRISE is
1. <DOC> </DOC> telling PRISE this is the START OF RECORD
2. <TEXT> </TEXT> telling PRISE this is a SECTION.
Within the data files they would look like:
```
<DOC>
<TEXT>
    Actual text
</TEXT>
</DOC>
```
dtd file
To build a dtd file, click here for some information on SGML. Cranfield example dtd file.

User generated files

list file
The list file creates a mapping of data files to dtd files. It contains two columns: names of data files and names of dtd files. The data files must be listed with their full pathname and the dtd file path must be relative to the work_dir. Ordinarily the dtd is in the work_dir so no pathname is needed.
Cranfield example list file.
sgmls.actions
The sgmls.actions is a mapping of tags to indexing actions. The current actions are
1. START OF RECORD - indicating the start of a record.
2. SECTION - indicating the data following the tag will be indexed.
3. DEFAULT - indicating the data following the tag will not be indexed.
The sgmls.actions file contains two columns, the first of tags and the second of actions corresponding to the tag in the first column. All actions must be in uppercase. Begin tags are prefixed with an open paren and end tags are prefixed with a closed paren. i.e. <DOC> and </DOC> are (DOC and )DOC, respectively. In the actions column one tag must be selected as START OF RECORD and all end tags are usually labelled DEFAULT actions. (NOTE: The program docmap.c currently requires that the START OF RECORD tag be DOC. We intend to remove this restriction in the near future.) If tags are not labelled as SECTION the terms occurring after them are not indexed. Tags not appearing in the sgmls.actions file will automatically be set to DEFAULT.
Cranfield example sgmls.actions file.
title_tags
An ASCII file containing the begin and end tags used for the title of every document. If more than one set of tags is given, they will be used in order. (e.g. If the first tag does not appear in a document the second set of tags will be used and so on until the list has been exhausted.) NOTE: At least one set of tags MUST appear in each document.
Cranfield example title_tags file.
commonwords
An ASCII file containing the stop words (one word per line, all lowercase) used when parsing terms. Words occuring in this list are not parsed and will not occur in the index. A default stopword list is provided with 23 commonly occurring terms.
options.spec
An ASCII file containing indexing and searching options. All lines beginning with "#" symbol are comments and all blank lines are ignored. A default options.spec is provided. An explanation of all the options is provided in the top comments of the default options.spec.
Cranfield example options.spec file.
fields.spec
A required ASCII file containing a mapping of user field names to SGML tags and Z39.50 "use" attribute numbers. This file must contain at least the "_docid_" tag for use in relevance feedback. The file must also contain an entry for each SGML-tagged field to which the user will be allowed to restrict a search.

All lines beginning with "#" symbol are comments and all blank lines are ignored. Tags with names beginning and ending with "_" are reserved for system use. Apart from this restriction, tag names may contain any non-whitespace characters. The first line of the fields.spec file is the number of entries to follow. An entry consists of the following information:
1. User's name for a field in a tagged document
2. Start SGML tag.
3. End SGML tag.
4. Z39.50 (use) attribute number associated with the field.
Cranfield example fields.spec file.

Sample list file:

data/cranfield.sgml cranfield.dtd

Sample dtd file:

<!DOCTYPE CRAN[

<!--   SGML DESCRIPTIONS         -->
<!--   +     Required and repeatable element -->
<!--   ?     Optional element                -->
<!--   *     Optional and repeatable element -->
<!--   ,     elements must follow in this order  -->
<!--   |     "or" connector (pick one element)   -->
<!--   &     "and" connector (all must occur in any order)   -->

<!--       ELEMENT     MIN  CONTENT -->
<!ELEMENT  DOCNO       - -  (#PCDATA) >
<!ELEMENT  TEXT        - -  (#PCDATA) >
<!ELEMENT  AUTHOR      - -  (#PCDATA) >
<!ELEMENT  BIBLIO      - -  (#PCDATA) >
<!ELEMENT  TITLE       - -  (#PCDATA) >
<!ELEMENT  DOC         - -  (DOCNO* &  TITLE* & AUTHOR* &
                             BIBLIO* & TEXT*) >
<!ELEMENT  CRAN        O O  (DOC+)>

<!ENTITY   amp      "&" >
<!ENTITY #DEFAULT SYSTEM >

]>

Sample sgmls.actions file:

(AUTHOR    SECTION
(BIBLIO    SECTION
(DOC       START OF RECORD
(DOCNO     DEFAULT
(TITLE     SECTION
(TEXT      SECTION
)AUTHOR    DEFAULT
)BIBLIO    DEFAULT
)DOC       DEFAULT
)DOCNO     DEFAULT
)TITLE     DEFAULT
)TEXT      DEFAULT

A sample title_tags file for the cranfield collection.

<TITLE>  </TITLE>
<TEXT>   </TEXT>

Sample options.spec file:


#--------------------------------------------------------------------------
# This file specifies options associated with the new_wsj database
# and must be located in the directory names in the database location file
#
# Options and values are strings of characters containing no whitespace.
# Options begin with a '-'.  
# Comments begin with a `#` anywhere on a line and end with the next NEWLINE 
# or end-of-file, whichever comes first.
# NEWLINEs are otherwise treated as whitespace.
#
# The following options are available:
#
# -eval            #
# -noeval
#
# -sortbyidf       # Sort qterms in search engine by idf 
# -nosortbyidf
#
# -tracetpi        # Term positional information
# -notracetpi
#
# -initonly        # Initialize only 
# -noinitonly
#
# -showquery       # Search engine prints out the qterms
# -noshowquery
#
# -qtermdup        # Search engine saves duplicate qterms (affects weights)
# -noqtermdup
#
# -qrefine         # Search engine refines query
# -noqrefine
#
# -docnos          # Display document numbers
# -nodocnos
#
# -lexnmzn type    # Lexical normalization function 
#                  # Normalization can include case folding, stemming
#                  # and possibly other functionality (accent stripping..)
#                  # There is a choice of stemmers, all doing case folding:
#                  # smart porter spanish french german none
#                  # Selecting the following passes the token untouched:
#                  # ident
#
# -errlog          # Error log file
#
# -showlog         # Show log file for status
#
# -tpi file        # Term positional information file
# -notpi
#
# -scorefn fn      # Score function used by search engine to rank documents
#                  # Choices are: tfidf, binary, idf, tf1idf, tf2idf,
#                  # tf3idf, tfidf_lln, tfplusidf, release_1, tf, bm25idf
#
# -weightfn fn     # Weight function used by indexer to rank documents
#                  # Choices are: binary, idf, tf1idf, tf2idf, tf3idf,
#                  # tfidf, tfdf, tf, tfidf_lln, inversetfidf, local_idf,
#                  # dwf_plus_okapi, gavin, okapi1, okapi2, okapi3,
#                  # okapi4, okapi5, okapi6, okapi7, okapi8, okapi9,
#                  # dwf1, dwf2, dwf3, dwf4, dwf5, dwf6, dwf7, dwf8,
#                  # dwf9, gamma1, noise, bm25idf
#
# -rfsortfn fn     # Sorting fucntion used by relevance feedback
#                  # where fn: TermIdf, TermIdf, TermPostrec
#                  # TermIdfWIPostrec, TermIdfFreqWIPostrec, TermIdfFreqPostrec
#                  # TermIdfFreq, Doszkocs, Porter, SmeatonvanRijsbergen
#                  # RobertsonCroft, RobertsonIdf.
#                  # Default is TermIdf
#
# -rfreweightfn fn # Weighting function used by relevance feedback
#                  # ReweightIdf sets the term's reweight to be the term's idf. In our
#                  # current method this is the equivant to not reweighting.
#                  # number of enhanced terms to be supplied when possible
#                  # Weighting function used by relevance feedback
#                  # where fn: Croft, Bookstein, Croft, CroftNeg,
#                  # CroftNeg0, ReweightIdf, Rsj
#                  # Default is Croft
#
# -minrfterms no.  # Minimum relevance feedback terms
#                  # where no. is an integer.
#
# -maxrfterms no.  # Maximum relevance feedback terms
#                  # where no. is an integer.
#--------------------------------------------------------------------------
-notpi
-noqrefine
-lexnmzn smart                    # Lexical normalization
-scorefn tfidf
-weightfn dwf
-docnos
-errlog         /tmp/test.err     # Error log file
-showlog        /tmp/test.show    # Trace file
# Relevance feedback Control Items
# Parsed document(s) term sorting flags
-rfsortfn RobertsonIdf
# Term reweighting methods
-rfreweightfn Rsj
-minrfterms 0
-maxrfterms 20
#

Sample fields.spec file:

# This is a fields.spec file, which associates the following pieces
# of data for the CRANFIELD collection:
#
#  user's name for a field in a tagged document
#  start tag
#  end tag
#  Z39.50 (use) attribute number associated with the field
#
# This information is repeated for each field which is tagged in the text
# of the document and which the user would like to be able to search.
#
# Each entry must begin with an "-entry" attribute.
# Before the first "-entry" attribute there must be a count of the total number
# of entries
# Within an entry the attribute-value pairs may appear in any order,
# Attributes and values must be surrounded by whitespace or comments.
# Within an attribute-value pair the attribute must come first.
#
# A comment begins with a pound sign and ends with the next newline
# Blank lines are ignored.
#------------------------------------------------------------------------------

3         # this entry count must be the first token in the file
#----------------------------------------------------------------
-entry                  # Required !
-fieldname  _docid_
-starttag   <_docid_>		
-endtag	    <_docid_>
-attrnum    1032	# Doc-id reserved for system use
#----------------------------------------------------------------
-entry			# Example field
-fieldname  author
-starttag   <author>
-endtag     <author>
-attrnum    1003        # author
#----------------------------------------------------------------
-entry			# Example field
-fieldname  title
-starttag   <title>
-endtag     <title>
-attrnum    4           # title

General Algorithm for build.script.sh

Build.script.sh has been provided for indexing multiple files for data. Below is an outline of the The Process and the programs it calls: The Process:

Loop on each data file
1. sgmls
2. sgmls.parser [-i] -w $PWD -p stdin
3. rel.build.tmm [-i] $PWD -
end loop
rebuild.tmm
prep
docmap
doctitles
docmpaseq

National Institute of Standards and Technology Home Page

Last updated: Tuesday, 01-Aug-2000 13:16:43 UTC

Date created: Monday, 31-Jul-00