User's Guide

Copyright (C) 2004-2005 Mikio Hirabayashi
Last Update: Thu, 04 Aug 2005 15:03:42 +0900

Table of Contents

  1. Introduction
  2. Attributes
  3. File Formats
  4. Search Conditions
  5. Administration Command
  6. CGI Script for Search

Introduction

This document describes detail of how to use applications of Hyper Estraier. If you have never read the introduction document, please read it beforehand.

Hyper Estraier is a full-text search system using index database. So, before search, it is needed to prepare an index into which target documents have been registered. Hyper Estraier provides the administration command `estcmd' and the CGI script `estsearch.cgi' for search. The former is used in order to administrate the index by command line interface. The latter is used in order to search the index for documents with a web browser.

estcmd can handle various file formats and features various operations to administrate index. How to use it is described in this document.

Hyper Estraier supports such various methods for search as combining some search phrase and search with attributes of documents. Moreover, it is possible to customize presentation according to the configuration of estseek.cgi. How to do it is described in this document.


Attributes

Not only information of the body text but also such attributes as the title, the modification date, and so on can be added to documents handled by Hyper Estraier. Attributes are used for such various purposes as search with attributes and determination of difference updating.

Attribute Name

Any attribute has a name. As the name can be determined arbitrarily, some names are reserved for being used as system attributes. Names of system attributes begin with "@". There are the following system attributes.

The other attributes except for system attributes are called user-defined attributes. They can be defined by document draft said later. Meta attributes in HTML and headers of MIME are also treated as user-defined attributes.

Attribute Type

There are two data types for attributes; string and number. Data of the string type are arbitrary strings. There are such operations as full matching, forward matching, backward matching, partial matching. Data of the number type are numbers or date information. A string of the number type is converted into the number and calculated according to the following formats. If the format is for date, the value is computed based on the UNIX epoch (1 Jan 1970).

The data type is not determined when registration. It is determined when search. Length of the value of an attribute is not limited.

Attributes and the body text of a document should be expressed in UTF-8 encoding. If another encoding is used, it should be converted into UTF-8. By the way, estcmd detect the encoding automatically if it is not clearly specified.

estcmd defines the URI attribute begins with "file://" for each document. However, if a document defines its own URI, it comes first. The URI of the local file system is defined as an attribute whose name is "_lpath". The absolute path on the local file system is defined as an attribute whose name is "_lreal". The file name is normalized to UTF-8 is defined as an attribute whose name is "_lfile". The encoding of the value of each attribute is normalized as UTF-8.


File Formats

estcmd handles four file formats. This section describes how the four are processed.

Plain Text

A document of plain-text is composed of strings with no structure. By default, files whose names end with ".txt", ".text", or ".asc" are treated as plain-text.

HTML

As we all know, a document of HTML is used as a hyper-text on the Web. By default, files whose names end with ".html", ".htm", "xhtml", or ".xht" are treated as HTML.

MIME (e-mail)

MIME is used for communication by e-mail based on RFC822 and so on. By default, files whose names end with ".eml", ".mime", ".mht", or ".mhtml" are treated as HTML.

If the content of each part of multipart is "text/plain", "text/html", or "message/rfc822", the content is treated as a part of the body text so that web archive can be supported.

Document Draft

Document draft is a original format of Hyper Estraier. It is possible to handle various formats in the integrative way by using document draft as intermediate format. By default, files whose names end with ".est" are treated as document draft.

Though format of document draft is similar to RFC822, detail points differ. The delimiter for headers is not ":" but "=". Moreover, no space character is needed after "=". The following is an example data to handle a MIDI document.

@uri=http://www.music-estraier.com/mididb/t/tw/twinkle.kar
@title=Twinkle Twinkle Little Star
@author=Jane Taylor
@cdate=2004-11-01T23:11:18+09:00
@mdate=2005-03-21T08:07:45+09:00
category=chorus,dance

Twinkle, twinkle, little star,
How I wonder what you are.
Up above the world so high,
Like a diamond in the sky.
Twinkle, twinkle, little star,
How I wonder what you are!
        Twinkle Twinkle Little Star
        Jane Taylor

The following specifications are required for document draft.

A hidden text is the same as normal text except not displayed in the snippet of the result. It is useful to search with some attributes.


Search Conditions

Two kinds of search conditions are supported. One is for full-text search and the other is for attribute search. If both are specified at the same time, documents corresponding to the both are searched for. Moreover, usual format and simplified format are supported for full-text search condition.

Full-text Search Conditions

The purpose of full-text search is to search for documents including some specified words. For example, if you search for documents including a word "computer", specify "computer" in the search phrase as it is.

You can specify two or more words. For example, if you specify "United Nations", documents including "united" followed by "nations" are searched for. In case of simplified form, specify the following.

"united nations"

Intersection operation is supported by the "AND" operator. For example, if you specify "internet AND security", documents including both of "internet" and "security" are searched for. In case of simplified form, specify the following.

internet security

Difference operation is supported by the "ANDNOT" operator. For example, if you specify "hacker ANDNOT cracker", documents including "hacker" but not including "cracker" are searched for. In case of simplified form, specify the following.

hacker ! cracker

Union operation is supported by the "OR" operator. For example, if you specify "proxy OR firewall", documents including one or both of "proxy" and "firewall" are searched for. In case of simplified form, specify the following.

proxy | firewall

Note that the priority of "OR" is higher than ones of "AND" and "ANDNOT". For example, if you specify "F1 OR F-1 OR Formula One AND Champion OR Victory", documents including one or both of "f1", "f-1", and "formula one", and including one or both of "champion" and "victory". In case of simplified form, specify the following.

F1 | F-1 | "Formula One" Champion | Victory

Search words are case insensitive. However, operators are case sensitive. If you want to search for documents including "AND", specify "and" instead.

Wild card is also supported. It can be used for forward match search and backward match search. For example, "[BW] euro" matches words which begin with "euro". And, "[EW] shere" matches words which end with "sphere". In case of simplified form, "euro*" and "*sphere" are used instead.

Attribute Search Conditions

The purpose of attribute search is to search for documents whose attributes are corresponding to the specified expression. An expression of attribute search is composed of an attribute name, an operator, and a value. They are separated with space characters. For example, if you specify "@title STRINC IMPORTANT", documents whose title includes "IMPORTANT". The following operators for attribute search are supported.

If an operator is leaded by "!", the meaning is inverted. If an operator is leaded by "I", case of the value is ignored.

Order of the Result

You can specify the order of the result by an expression. An ordering expression is composed of an attribute name and an operator. For example, if you specify "@size NUMA", documents in the result are in ascending order of the size. The following operators for ordering are supported.

By default, the order of the result is descending by score. The score is calculated by the number of specified words in each document.


Administration Command

This section describes specification of estcmd. estcmd can do not only indexing but also search.

Synopsis and Description

estcmd is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument db specifies the path of an index.

estcmd put [-cl] db [file]
Register a document of document draft to an index.
file specifies a target file. If it is omitted, the standard input is read.
If -cl is specified, regions of a overwritten document are cleaned up.
estcmd out [-cl] db expr
Remove information of a document from an index.
expr specifies the ID number or the URI of a document.
If -cl is specified, regions of the document are cleaned up.
estcmd edit [-cl] db expr name [value]
Edit an attribute of a document in an index.
expr specifies the ID number or the URI of a document.
name specifies the name of an attribute.
value specifies the value of the attribute. If it is omitted, the attribute is removed.
estcmd get db expr [attr]
Output document draft of a document in an index.
expr specifies the ID number or the URI of a document.
If attr is specified, only the value of the attribute is output.
estcmd list db
Output a list of all document in an index.
estcmd uriid db uri
Output the ID number of a document specified by URI.
uri specifies the URI of a document.
estcmd meta db [name [value]]
Handle meta data.
name specifies the name of a piece of meta data. If it is omitted, a list of all names is output.
value specifies the value of the meta data to be recorded. If it is omitted, the current value is output. If it is an empty string, the meta data is removed.
estcmd inform db
Output the number of documents and the number of unique words in an index.
estcmd optimize [-onp] [-ond] db
Optimize an index and clean up dispensable regions.
If -onp is specified, it is omitted to clean up dispensable regions.
If -ond is specified, it is omitted to optimize the database files.
estcmd search [-ic enc] [-vu|-va|-vf|-vs|-vh|-vx|-dd] [-kn num] [-gs|-gf|-ga] [-cd] [-ni] [-sf] [-hs] [-attr expr] [-ord expr] [-max num] [-sim id] db [phrase]
Search an index for documents.
phrase specifies the search phrase.
-ic specifies the input encoding. By default, it is UTF-8.
If -vu is specified, TSV of ID number and URI are output.
If -va is specified, multipart format including attributes is output.
If -vf is specified, multipart format including document draft is output.
If -vs is specified, multipart format including attributes and snippets is output.
If -vh is specified, human readable format including attributes and snippets is output.
If -vx is specified, XML including including attributes and snippets is output.
If -dd is specified, document draft data are dumped and saved into separated files.
-kn specifies the number of keywords to be extracted. By default, no keyword is extracted.
If -gs is specified, every key of N-gram is checked. By default, it is alternately.
If -gf is specified, keys of N-gram are checked every three.
If -ga is specified, keys of N-gram are checked every four.
If -cd is specified, whether documents match the search phrase definitely is checked.
If -ni is specified, TF-IDF tuning is omitted.
If -sf is specified, the phrase is treated as a simplified form.
If -hs is specified, score information is output as a hint.
-attr specifies an attribute search condition. This option can be specified multiple times.
-ord specifies the order expression. By default, it is descending by score.
-max specifies the maximum number of shown documents. Negative means unlimited. By default, it is 10.
-sim specifies the ID number of the seed document for similarity search.
estcmd gather [-cl] [-fe|-ft|-fh|-fm] [-fx sufs cmd] [-fz] [-fo] [-rm sufs] [-ic enc] [-il lang] [-bc] [-pc enc] [-px name] [-apn] [-sd] [-cm] [-cs num] db [file|dir]
Scan the local file system and register documents into an index.
If the third argument is the name of a file, a list of paths of target documents are read from it. If it is "-", the standard input is specified.
If the third argument is the name of a directory. All files under the directory are treated as target documents.
If -cl is specified, regions of overwritten documents are cleaned up.
If -fe is specified, target files are treated as document draft. By default, the format is detected by the suffix of each document.
If -ft is specified, target files are treated as plain text.
If -fh is specified, target files are treated as HTML.
If -fm is specified, target files are treated as MIME.
If -fx is specified, target files with the specified suffixes are processed by the specified outer command. If the command is leaded by "T@", the output of the command is treated as plain text. If the command is leaded by "H@", the output of the command is treated as HTML. If the command is leaded by "M@", the output of the command is treated as MIME. Else, the output is treated as document draft. This option can be specified multiple times.
If -fz is specified, documents which do not corresponding to the condition of -fx are ignored.
If -fo is specified, target files are not read. It is useful for efficient process of the outer command.
If -rm is specified, target files with the specified suffixes are removed. "*" matches any file. This option can be specified multiple times.
-ic specifies the input encoding. By default, it is detected automatically.
-il specifies the preferred input language. By default, English is preferred.
If -bc is specified, binary files are detected and ignored.
-pc specifies the encoding of file paths. By default, it is ISO-8859-1.
-px specifies the name of an attribute read from the list of paths. As the list of paths can be in TSV format, the first field is treated as the path of a target document, the second field and the followers are definitions of attribute values. -px specifies the name of each values of the second field and the followers. This option can be specified multiple times.
If -apn is specified, N-gram analysis is performed against European text also.
If -sd is specified, the creation date and the modification date of each file is recorded as attributes.
If -cm is specified, documents whose modification date has not changed are ignored.
-cs specifies the size of cache memory by mega bytes. By default, it is 64Mb.
estcmd purge [-cl] [-fc] db [prefix]
Purge information of documents which do not exist on the file system.
If prefix is specified, only documents whose URIs are begins with it.
If -cl is specified, regions of the deleted documents are cleaned up.
If -fc is specified, information of all target documents are deleted.
estcmd extkeys [-fc] [-dfdb file] [-ni] [-kn num] db [prefix]
Create a database of keywords extracted from documents.
If prefix is specified, only documents whose URIs are begins with it.
If -fc is specified, all target documents are processed whichever they have existing records or not.
-dfdb specifies an outher database of document frequency. By default, document frequency is calculated dynamically according to the index.
If -ni is specified, TF-IDF tuning is omitted.
-kn specifies the number of keywords to be extracted.
estcmd words [-dfdb file] db
Output a list of all unique words and each record size which is treated as docuemnt frequency.
-dfdb specifies an outer database where the result is stored. By default, the result is output to the standard output as TSV. If the outer database already exists, the value of each record is incremented.
estcmd draft [-ft|-fh|-fm] [-ic enc] [-il lang] [-bc] [file]
For test and debug.
estcmd break [-ic enc] [-il lang] [-apn] [-wt] [file]
For test and debug.
estcmd randput [-ren|-rla|-reu|-ror|-rjp|-rch] [-cs num] db dnum
For test and debug.
estcmd wicked db dnum
For test and debug.
estcmd regression db
For test and debug.
estcmd version
Show the version information.

All sub commands return 0 if the operation is success, else return 1. As for put, out, gather, purge, randput, wicked, and regression, they finish with closing the database when they catch the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), 13 (SIGPIPE), or 15 (SIGTERM).

The encoding name specified by -ic option should be such name registered to IETF as UTF-8, ISO-8859-1, and so on. The language name specified by -il option should be one of "en" (English), "ja" (Japanese, "zh" (Chinese), "ko" (Korean).

The outer command specified by -fx option of gather receives the path of the target document by the first argument and the path for output by the second argument. The original path of the target document is given as the value of the environment variable `ESTORIGFILE'.

Note that similarity search is very slow, by default. To improve the performance of similarity search, running "estcmd extkeys" beforehand is strongly recommended.

Examples

The following is to register mail files of mh format.

find /home/mikio/Mail -type f | egrep 'inbox/(business|friends)/[0-9]+$' |
  estcmd gather -cl -fm -cm casket -

The following is to register MS-Office files. estfxmsotohtml requires wvWare and xlhtml.

PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
estcmd gather -cl -fx ".doc,.xls,.ppt" "H@estfxmsotohtml" -fz -sd -cm casket .

The following is to register PDF files. estfxpdftohtml requires pdftotext.

PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
estcmd gather -cl -fx ".pdf" "H@estfxpdftohtml" -fz -sd -cm casket .

The following is to register cache files of WWWOFFLE, a proxy server. estwolefind requires WWWOFFLE.

estwolefind /var/spool/wwwoffle |
  estcmd gather -cl -fm -bc -px @uri -px _lfile -sd -cm casket -

The following is to output the search result as XML.

estcmd search -vx -max 8 casket 'socket AND shutdown'

CGI Script for Search

This section describes specification of estseek.cgi. The subject matter is to write configuration files.

Composition

estseek.cgi needs three configuration files; the prime configuration file, the template file, and the top page file. Their default names are `estseek.cgi', `estseek.tmpl', and `estseek.top'.

The name of the prime configuration file is determined by changing the suffix of the CGI script to ".conf". If you change the name of `estseek.cgi' to `estsearch.cgi', `estsearch.conf' is read. Names of the template file and the top page file is described in the prime configuration file. So, you can install some sets of search scripts in one directory.

As estseek.cgi is installed as `/usr/local/libexec/estseek.cgi', copy it to a directory for CGI scripts. Moreover, as samples of configurations are installed in `/usr/local/share/hyperestraier/', copy and modify them.

Prime Configuration File

The prime configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

indexname: casket
tmplfile: estseek.tmpl
topfile: estseek.top
logfile:
lprefix: file:///home/mikio/public_html/
gprefix: http://localhost/
gsuffix:
dirindex: index.html
replace: //localhost/{{!}}//127.0.0.1/
replace: //127.0.0.1:80/{{!}}//127.0.0.1/
showlreal: false
perpage: 10,20,30,40,50,100
attrselect: false
showscore: false
extattr: author|Author
extattr: from|From
extattr: to|To
extattr: cc|Cc
extattr: date|Date
snipwwidth: 480
sniphwidth: 96
snipawidth: 96
condgstep: 2
dotfidf: true
scancheck: false
smplphrase: true
candetail: true
smlrvnum: 0
spcache:

Means of each variable is the following.

Template File

The template file is to determine appearance of the page. It describes HTML and the data is shown as it is. However, "<!--ESTFORM-->" is replaced by the form to input search conditions. "<!--ESTRESULT-->" is replaced by the search result. "<!--ESTINFO-->" is replaced by information of the index.

Top Page File

When a user access the CGI script first or if no configuration is input, the content of the top page file is displayed instead of the search result. By default, usage of the CGI script is described there.

Search Form

If you want set the search form in another page, write the following HTML.

<form method="get" action="estseek.cgi">
<div>
<input type="text" name="phrase" value="" size="32" />
<input type="submit" value="Search" />
<input type="hidden" name="enc" value="UTF-8" />
</div>
</form>

Change "estseek.cgi" to the URI of setseek.cgi. Change "UTF-8" to the encoding name of the page.