> sgmls -s <dtd file> <data file> >& sgmls.errors
build.script.sh - Runs the PRISE indexing programs on a given set of data. It generates all the necessary files for testing with the Z39.50 client/server, except the user generated files mentioned above.
To index the cranfield data from the cranfield subdirectory.
> build.script.sh . listBuild.script.sh requires the work or collection directory as a first parameter usually the directory you are currently in ".", and the list file as a second parameter, build.script.sh work_directory list_file build.script.sh USAGE:
------------------------------------------------------------------------ Usage: build.script.sh <work_dir> <listfile> [-d <dir>] [-m] Usage: <work_dir> <list_file> [-m] <work_dir> is the path/directory where your index is created. <list_file> of the form path/filename listing all data files. The path must be relative to the work_dir -m lists other special options ------------------------------------------------------------------------
You can make your newly indexed collection available to a zserver by editing the "db_location.spec" file located in the "defaultdir" directory for that zserver, assuming the collection index files are accessible to the zserver. Comments at the top of the "db_location.spec" file explain the format of the contents.
<DOC> <TEXT> Actual text </TEXT> </DOC>dtd file
sgmls.actions
The sgmls.actions is a mapping of tags to indexing actions.
The current actions are
title_tags
An ASCII file containing the begin and end tags used for the title of
every document. If more than one set of tags is given, they will be used
in order. (e.g. If the first tag does not appear in a document the second set of
tags will be used and so on until the list has been exhausted.) NOTE: At least
one set of tags MUST appear in each document.
Cranfield example title_tags file.
commonwords
An ASCII file containing the stop words (one word per line, all lowercase)
used when parsing terms.
Words occuring in this list are not parsed and will not occur in the index.
A default stopword list is provided with 23 commonly occurring terms.
options.spec
An ASCII file containing indexing and searching options.
All lines beginning with "#" symbol are comments and all blank lines are ignored.
A default options.spec is provided. An explanation of all the options is provided in the top comments of the default options.spec.
Cranfield example options.spec file.
fields.spec
A required ASCII file containing a mapping of user field names to SGML tags
and Z39.50 "use" attribute numbers.
This file must contain at least the "_docid_" tag
for use in relevance feedback. The file must also contain an entry
for each SGML-tagged field to which the user will be allowed to restrict
a search.
All lines beginning with "#" symbol are comments and all blank lines are ignored. Tags with names beginning and ending with "_" are reserved for system use. Apart from this restriction, tag names may contain any non-whitespace characters. The first line of the fields.spec file is the number of entries to follow. An entry consists of the following information:
data/cranfield.sgml cranfield.dtd
<!DOCTYPE CRAN[ <!-- SGML DESCRIPTIONS --> <!-- + Required and repeatable element --> <!-- ? Optional element --> <!-- * Optional and repeatable element --> <!-- , elements must follow in this order --> <!-- | "or" connector (pick one element) --> <!-- & "and" connector (all must occur in any order) --> <!-- ELEMENT MIN CONTENT --> <!ELEMENT DOCNO - - (#PCDATA) > <!ELEMENT TEXT - - (#PCDATA) > <!ELEMENT AUTHOR - - (#PCDATA) > <!ELEMENT BIBLIO - - (#PCDATA) > <!ELEMENT TITLE - - (#PCDATA) > <!ELEMENT DOC - - (DOCNO* & TITLE* & AUTHOR* & BIBLIO* & TEXT*) > <!ELEMENT CRAN O O (DOC+)> <!ENTITY amp "&" > <!ENTITY #DEFAULT SYSTEM > ]>
(AUTHOR SECTION (BIBLIO SECTION (DOC START OF RECORD (DOCNO DEFAULT (TITLE SECTION (TEXT SECTION )AUTHOR DEFAULT )BIBLIO DEFAULT )DOC DEFAULT )DOCNO DEFAULT )TITLE DEFAULT )TEXT DEFAULT
<TITLE> </TITLE> <TEXT> </TEXT>
#-------------------------------------------------------------------------- # This file specifies options associated with the new_wsj database # and must be located in the directory names in the database location file # # Options and values are strings of characters containing no whitespace. # Options begin with a '-'. # Comments begin with a `#` anywhere on a line and end with the next NEWLINE # or end-of-file, whichever comes first. # NEWLINEs are otherwise treated as whitespace. # # The following options are available: # # -eval # # -noeval # # -sortbyidf # Sort qterms in search engine by idf # -nosortbyidf # # -tracetpi # Term positional information # -notracetpi # # -initonly # Initialize only # -noinitonly # # -showquery # Search engine prints out the qterms # -noshowquery # # -qtermdup # Search engine saves duplicate qterms (affects weights) # -noqtermdup # # -qrefine # Search engine refines query # -noqrefine # # -docnos # Display document numbers # -nodocnos # # -lexnmzn type # Lexical normalization function # # Normalization can include case folding, stemming # # and possibly other functionality (accent stripping..) # # There is a choice of stemmers, all doing case folding: # # smart porter spanish french german none # # Selecting the following passes the token untouched: # # ident # # -errlog # Error log file # # -showlog # Show log file for status # # -tpi file # Term positional information file # -notpi # # -scorefn fn # Score function used by search engine to rank documents # # Choices are: tfidf, binary, idf, tf1idf, tf2idf, # # tf3idf, tfidf_lln, tfplusidf, release_1, tf, bm25idf # # -weightfn fn # Weight function used by indexer to rank documents # # Choices are: binary, idf, tf1idf, tf2idf, tf3idf, # # tfidf, tfdf, tf, tfidf_lln, inversetfidf, local_idf, # # dwf_plus_okapi, gavin, okapi1, okapi2, okapi3, # # okapi4, okapi5, okapi6, okapi7, okapi8, okapi9, # # dwf1, dwf2, dwf3, dwf4, dwf5, dwf6, dwf7, dwf8, # # dwf9, gamma1, noise, bm25idf # # -rfsortfn fn # Sorting fucntion used by relevance feedback # # where fn: TermIdf, TermIdf, TermPostrec # # TermIdfWIPostrec, TermIdfFreqWIPostrec, TermIdfFreqPostrec # # TermIdfFreq, Doszkocs, Porter, SmeatonvanRijsbergen # # RobertsonCroft, RobertsonIdf. # # Default is TermIdf # # -rfreweightfn fn # Weighting function used by relevance feedback # # ReweightIdf sets the term's reweight to be the term's idf. In our # # current method this is the equivant to not reweighting. # # number of enhanced terms to be supplied when possible # # Weighting function used by relevance feedback # # where fn: Croft, Bookstein, Croft, CroftNeg, # # CroftNeg0, ReweightIdf, Rsj # # Default is Croft # # -minrfterms no. # Minimum relevance feedback terms # # where no. is an integer. # # -maxrfterms no. # Maximum relevance feedback terms # # where no. is an integer. #-------------------------------------------------------------------------- -notpi -noqrefine -lexnmzn smart # Lexical normalization -scorefn tfidf -weightfn dwf -docnos -errlog /tmp/test.err # Error log file -showlog /tmp/test.show # Trace file # Relevance feedback Control Items # Parsed document(s) term sorting flags -rfsortfn RobertsonIdf # Term reweighting methods -rfreweightfn Rsj -minrfterms 0 -maxrfterms 20 #
# This is a fields.spec file, which associates the following pieces # of data for the CRANFIELD collection: # # user's name for a field in a tagged document # start tag # end tag # Z39.50 (use) attribute number associated with the field # # This information is repeated for each field which is tagged in the text # of the document and which the user would like to be able to search. # # Each entry must begin with an "-entry" attribute. # Before the first "-entry" attribute there must be a count of the total number # of entries # Within an entry the attribute-value pairs may appear in any order, # Attributes and values must be surrounded by whitespace or comments. # Within an attribute-value pair the attribute must come first. # # A comment begins with a pound sign and ends with the next newline # Blank lines are ignored. #------------------------------------------------------------------------------ 3 # this entry count must be the first token in the file #---------------------------------------------------------------- -entry # Required ! -fieldname _docid_ -starttag <_docid_> -endtag <_docid_> -attrnum 1032 # Doc-id reserved for system use #---------------------------------------------------------------- -entry # Example field -fieldname author -starttag <author> -endtag <author> -attrnum 1003 # author #---------------------------------------------------------------- -entry # Example field -fieldname title -starttag <title> -endtag <title> -attrnum 4 # title
Build.script.sh has been provided for indexing multiple files for data.
Below is an outline of the The Process and the programs it calls:
The Process: