3.4. indexer configuration

3.4.1. Specifying WEB space to be indexed

When indexer tries to insert a new URL into database or is trying to index an existing one, it first of all checks whether this URL has corresponding Server, Realm or Subnet command given in indexer.conf. URLs without corresponding Server, Realm or Subnet command are not indexed. By default those URLs which are already in database and have no Server/Realm/Subnet commands will be deleted from database. It may happen for example after removing some Server/Realm/Subnet commands from indexer.conf.

These commands have following format:

[Server | Realm | Subnet] [method] [subsection] [CaseType] [MatchType] [CmpType] pattern [alias]

Mandatory parameter pattern specify an URL, or it part, or pattern to compare.

Optional parameter method specify an document action for this command. May take values: Allow, Disallow, HrefOnly, CheckOnly, Skip, CheckMP3, CheckMP3Only. By default, the value Allow is used.

  1. Allow

    Value Allow specify that all corresponding documents will be indexed and scanned for new links. Depends on Content-Type appropriate external parser is executed if need.

  2. Disallow

    Value Disallow specify that all corresponding documents will be ignored and deleted from database, if its was placed into before.

  3. HrefOnly

    Value HrefOnly specify that all corresponding documents will be only scanned for new links (not indexed). This is useful, for example, for mail archives indexing, when index pages is only scanned to detect new messages for indexing.

    
Server HrefOnly Page http://www.mail-archive.com/general%40mnogosearch.org/
    Server Allow    Path http://www.mail-archive.com/general%40mnogosearch.org/
    

  4. CheckOnly

    Value CheckOnly specify that all corresponding documents will be requested by HTTP HEAD request, not HTTP GET, i.e. inly brief info about documents (size, last modified, content type) will be fetched. This allow, for example, check links on your site:

    
Server HrefOnly  http://www.mnogosearch.org/
    Realm  CheckOnly *
    

    These commands instruct indexer to scan all documents on www.mnogosearch.org site and collect all links. Brief info about every document found will be requested by HEAD method. After indexing done, indexer -S command will show status for all documents from this site.

  5. Skip

    Value Skip specify that all corresponding documents will be skipped while indexing. This is useful when need temporally disable reindexing several sites, but able search on. These documents will marked as expired.

  6. CheckMP3

    Value CheckMP3 specify that corresponding documents will be checked for MP3 tags along if its Content-Type is equal to audio/mpeg. This is useful, for example, if remote server supply application/octet-stream as Content-Type for MP3 files. If this tag is present, these files will indexed as MP3 file, otherwise its will be processed according to Content-Type.

  7. CheckMP3Only

    This value is equal to CheckMP3, but if MP3 tag is not present, processing on Content-Type will not be taken.

Use optional subsection parameter to specify server's checking behavior. Subsection value must be one of the following: nofollow, page, path, site, world and has "path" value by default.

  1. path subsection

    When indexer seeks for a "Server" command corresponding to an URL it checks that the discovered URL starts with URL given in Server command argument but without trailing file name. For example, if Server path http://localhost/path/to/index.html is given, all URLs which have http://localhost/path/to/ at the beginning correspond to this Server command.

    The following commands have the same effect except that they insert different URLs into database:

    
Server path http://localhost/path/to/index.html
    Server path http://localhost/path/to/index
    Server path http://localhost/path/to/index.cgi?q=bla
    Server path http://localhost/path/to/index?q=bla
    

  2. site subsection

    indexer checks that the discovered URL have the same hostname with URL given in Server command. For example, Server site http://localhost/path/to/a.html will allow to index whole http://localhost/ server.

  3. world subsection

    If world subsection is specified in Server command, it has the same effect that URL is considered to match this Server command. See explanation below.

  4. page subsection

    This subsection describes the only one URL given in Server argument.

  5. nofollow subsection

    Skip links following for URL that match the pattern.

  6. subsection in news:// schema

    Subsection is always considered as "site" for news:// URL schema. This is because news:// schema has no nested paths like ftp:// or http:// Use Server news://news.server.com/ to index whole news server or for example Server news://news.server.com/udm to index all messages from "udm" hierarchy.

Optional parameter CaseType is specify the case sensivity for string comparison, it can take one of follow value: case - case insensitive comparison, or nocase - case sensitive comparison.

Optional parameter CmpType is specify the type of comparison and can take two value: Regex and String. String wildcards is default match type. You can use ? and * signs in URLMask parameters, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in .ru domain, use this command:

Realm http://*.ru/*

Regex comparison type takes a regular expression as it's argument. Activate regex comparison type using Regex keyword. For example, you can describe everything in .ru domain using regex comparison type:

Realm Regex ^http://.*\.ru/

Optional parameter MatchType means match type. There are Match and NoMatch possible values with Match as default. Realm NoMatch has reverse effect. It means that URL that does not match given pattern will correspond to this Realm command. For example, use this command to index everything without .com domain:

Realm NoMatch http://*.com/*

Optional alias argument allows providing very complicated URL rewrite more powerful than other aliasing mechanism. Take a look Section 3.4.2 for alias argument usage explanation. Alias works only with Regex comparison type and has no effect with String type.

3.4.1.1. Server command

This is the main command of the indexer.conf file. It is used to add servers or their parts to be indexed. This command also says indexer to insert given URL into database at startup.

E.g. command Server http://localhost/ allows to index whole http://localhost/ server. It also makes indexer insert given URL into database at startup. You can also specify some path to index server subsection: Server http://localhost/subsection/. It also says indexer to insert given URL at startup.

Note: You can suppress indexer behavior to add URL given in Server command by using -q indexer command line argument. It is useful when you have hundreds or thousands Server commands and their URLs are already in database. This allows having more quick indexer startup.

3.4.1.2. Realm command

Realm command is a more powerful means of describing web area to be indexed. It works almost like Server command but takes a regular expression or string wildcards as it's pattern parameter and do not insert any URL into database for indexing.

3.4.1.3. Subnet command

Subnet command is another way to describe web area to be indexed. It works almost like Server command but takes a string wildcards or network specified in CIDR presentation format as it's pattern argument which is compared against IP address instead of URL. In case of string wilcards formant, argument may have * and ? signs, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in your local subnet, use this command:

Subnet 192.168.*.*
In case of network specified in CIDR presentation format, you may specify subnet in forms: a.b.c.d/m, a.b.c, a.b, a
Subnet 1291.168.10.0/24

You may use "NoMatch" optional argument. For example, if you want to index everything without 195.x.x.x subnet, use:

Subnet NoMatch 195.*.*.*

3.4.1.4. Using different parameter for server and it's subsections

Indexer seeks for "Server" and "Realm" commands in order of their appearance. Thus if you want to give different parameters to e.g. whole server and its subsection you should add subsection line before whole server's. Imagine that you have server subdirectory which contains news articles. Surely those articles are to be reindexed more often than the rest of the server. The following combination may be useful in such cases:


# Add subsection
Period 200000
Server http://servername/news/

# Add server
Period 600000
Server http://servername/

These commands give different reindexing period for /news/ subdirectory comparing with the period of server as a whole. indexer will choose the first "Server" record for the http://servername/news/page1.html as far as it matches and was given first.

3.4.1.5. Default indexer behavior

The default behavior of indexer is to follow through links having correspondent Server/Realm command in the indexer.conf file. It also jumps between servers if both of them are present in indexer.conf either directly in Server command or indirectly in Realm command. For example, there are two Server commands:


Server http://www/
Server http://web/

When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record.

If one of the Server command is deleted, indexer will remove all expired URLs from this server during next reindexing.

3.4.1.6. Using indexer -f <filename>

The third scheme is very useful for indexer -i -f url.txt running. You may maintain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup.

3.4.2. Aliases

DataparkSearch has an alias support making it possible to index sites taking information from another location. For example, if you index local web server, it is possible to take pages directly from disk without involving your web server in indexing process. Another example is building of search engine for primary site and using its mirror while indexing. There are several ways of using aliases.

3.4.2.1. Alias indexer.conf command

Format of "Alias" indexer.conf command:


Alias <masterURL> <mirrorURL>

E.g. you wish to index http://search.mnogo.ru/ using nearest German mirror http://www.gstammw.de/mirrors/mnoGoSearch/. Add these lines in your indexer.conf:


Server http://search.mnogo.ru/
Alias  http://search.mnogo.ru/  http://www.gstammw.de/mirrors/mnoGoSearch/

search.cgi will display URLs from master site http://search.mnogo.ru/ but indexer will take corresponding page from mirror site http://www.gstammw.de/mirrors/mnoGoSearch/.

Another example. If you want to index everything in udm.net domain and one of servers, for example http://home.udm.net/ is stored on local machine in /home/httpd/htdocs/ directory. These commands will be useful:


Realm http://*.udm.net/
Alias http://home.udm.net/ file:/home/httpd/htdocs/

Indexer will take home.udm.net from local disk and index other sites using HTTP.

3.4.2.2. Different aliases for server parts

Aliases are searched in the order of their appearance in indexer.conf. So, you can create different aliases for server and its parts:


# First, create alias for example for /stat/ directory which
# is not under common location:
Alias http://home.udm.net/stat/  file:/usr/local/stat/htdocs/

# Then create alias for the rest of the server:
Alias http://home.udm.net/ file:/usr/local/apache/htdocs/

Note: if you change the order of these commands, alias for /stat/ directory will never be found.

3.4.2.3. Using alias in Server command

You may specify location used by indexer as an optional argument for Server command:


Server  http://home.udm.net/  file:/home/httpd/htdocs/

3.4.2.4. Using alias in Realm command

Aliases in Realm command is a very powerful feature based on regular expressions. The idea of aliases in Realm command implementation is similar to how PHP preg_replace() function works. Aliases in Realm command work only if "regex" match type is used and does not work with "string" match type.

Use this syntax for Realm aliases:


Realm regex <URL_pattern> <alias_pattern>

Indexer searches URL for matches to URL_pattern and builds an URL alias using alias_pattern. alias_pattern may contain references of the form $n. Where n is a number in the range of 0-9. Every such reference will be replaced by text captured by the n'th parenthesized pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern.

Example: your company hosts several hundreds users with their domains in the form of www.username.yourname.com. Every user's site is stored on disk in "htdocs" under user's home directory: /home/username/htdocs/.

You may write this command into indexer.conf (note that dot '.' character has a special meaning in regular expressions and must be escaped with '\' sign when dot is used in usual meaning):


Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*)  file:/home/$2/htdocs/$4

Imagine indexer process http://www.john.yourname.com/news/index.html page. It will build patterns from $0 to $4:


   $0 = 'http://www.john.yourname.com/news/index.htm' (whole patter match)
   $1 = 'http://www.'      subpattern matches '(http://www\.)'
   $2 = 'john'             subpattern matches '(.*)'
   $3 = '.yourname.com/'   subpattern matches '(\.yourname\.com/)'
   $4 = '/news/index.html' subpattern matches '(.*)'

Then indexer will compose alias using $2 and $4 patterns:


file:/home/john/htdocs/news/index.html

and will use the result as document location to fetch it.

3.4.2.5. AliasProg command

You may also specify "AliasProg" command for aliasing purposes. AliasProg is useful for major web hosting companies which want to index their web space taking documents directly from a disk without having to involve web server in indexing process. Documents layout may be very complex to describe it using alias in Realm command. AliasProg is an external program that can be called, that takes a URL and returns one string with the appropriate alias to stdout. Use $1 to pass URL to command line.

For example this AliasProg command uses 'replace' command from MySQL distribution and replaces URL substring http://www.apache.org/ to file:/usr/local/apache/htdocs/:


AliasProg  "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:/usr/local/apache/htdocs/"

You may also write your own very complex program to process URLs.

3.4.2.6. ReverseAlias command

The ReverseAlias indexer.conf command allows URL mapping before URL is inserted into database. Unlike Alias command, that triggers mapping right before a document is downloaded, ReverseAlias command triggers mapping after the link is found.


ReverseAlias http://name2/   http://name2.yourname.com/
Server       http://name2.yourname.com/

All links with short server name will be mapped to links with full server name before they are inserted into database.

One of the possible use is cutting various unnecessary strings like PHPSESSION=XXXX

E.g. cutting from URL like http://www/a.php?PHPSESSION=XXX, when PHPSESSION is the only parameter. The question sign is deleted as well:


ReverseAlias regex  (http://[^?]*)[?]PHPSESSION=[^&]*$          $1$2

Cutting from URL like w/a.php?PHPSESSION=xxx&.., i.e. when PHPSESSION is the first parameter, but there are other parameters following it. The '&' sign after PHPSESSION is deleted as well. Question mark is not deleted:


ReverseAlias regex  (http://[^?]*[?])PHPSESSION=[^&]*&(.*)      $1$2

Cutting from URL like http://www/a.php?a=b&PHPSESSION=xxx or http://www/a.php?a=b&PHPSESSION=xxx&c=d, where PHPSESSION is not the first parameter. The '&' sign before PHPSESSION is deleted:


ReverseAlias regex  (http://.*)&PHPSESSION=[^&]*(.*)         $1$2

3.4.2.7. Alias in search.htm search template

It is also possible to define aliases in search template (search.htm). The Alias command in search.htm is identical to the one in indexer.conf, however it is active during searching, not indexing.

The syntax of the search.htm Alias command is the same as in indexer.conf:


Alias <find-prefix> <replace-prefix>

For example, there is the following command in search.htm:


Alias http://localhost/ http://www.mnogo.ru/

Search returned a page with the following URL:


http://localhost/news/article10.html

As a result, the $(DU) variable will be replace NOT with this URL:


http://localhost/news/article10.html

but with the following URL (that results in processing with Alias):


http://www.mnogo.ru/news/article10.html

3.4.3. ServerTable

DataparkSearch has ServerTable indexer.conf command. It allow load servers and filters configuration from SQL table.

3.4.3.1. Loading servers table

When ServerTable mysql://user:pass@host/dbname/tablename[?srvinfo=infotablename] is specified, indexer will load servers information from given tablename SQL table, and will load servers parameters from given infotablename SQL table. If srvinfo parameter is not specified, parameters will be loaded from srvinfo table. Check the structure for server and srvinfo tables in create/mysql/create.txt file. If there is no structure example for your database, take it as an example.

You may use several ServerTable command to load servers information from different tables.

3.4.3.2. Server table structure

Servers table consists of all necessary fields which describe servers parameters. Field names have correspondent indexer.conf commands. For example, "period" field corresponds "Period" indexer.conf command. Default field values are the same with default indexer.conf parameters.

"gindex" field corresponds "Index" command. Name is slightly changed to avoid SQL reserved word usage.

Description for several fields see in Section 9.3.

Note: Only those servers are read from the table where "active" field has 1 value and "parent" field has 0 value. This is useful to allow users to submit new URLs into servers table and give administrator a possibility to approve added URLs.

3.4.4. FlushServerTable

Flush server.active to inactive for all server table records. Use this command to deactivate all command in servertable before load new from indexer.conf or from other servertable.

3.4.5. External parsers

DataparkSearch indexer can use external parsers to index various file types (mime types).

Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout.

3.4.5.1. Supported parser types

Indexer supports four types of parsers that can:

  • read data from stdin and send result to stdout

  • read data from file and send result to stdout

  • read data from file and send result to file

  • read data from stdin and send result to file

3.4.5.2. Setting up parsers

  1. Configure mime types

    Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there.

    If you want to index local files or via ftp use "AddType" command in indexer.conf to associate file name extensions with their mime types. For example:

    
AddType text/html *.html
    

  2. Add parsers

    Add lines with parsers definitions. Lines have the following format with three arguments:

    
Mime <from_mime> <to_mime> <command line>
    

    For example, the following line defines parser for man pages:

    
# Use deroff for parsing man pages ( *.man )
    Mime  application/x-troff-man   text/plain   deroff
    

    This parser will take data from stdin and output result to stdout.

    Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this:

    
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
    

    If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example:

    
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
    

    The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondingly.

3.4.5.3. Avoid indexer hang on parser execution

To avoid a indexer hang on parser execution you may specify in your indexer.conf amount of time in seconds for parser execution by ParserTimeOut command. For example:


ParserTimeOut 600

Default value is 300 seconds, i.e. 5 minutes.

3.4.5.4. Pipes in parser's command line

You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk:


AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"

3.4.5.5. Charsets and parsers

Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents:


Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"

3.4.5.6. DPS_URL environment variable

When executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts.

3.4.5.7. Some third-party parsers

  • RPM parser by Mario Lang

    /usr/local/bin/rpminfo:

    
#!/bin/bash
    /usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE}
    (%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body>
    %{DESCRIPTION}\n</body></html>" -p $1
    

    indexer.conf:

    
Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"
    

    It renders to such nice RPM information:

    
3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4]
           Mysql is a SQL (Structured Query Language) database server.
           Mysql was written by Michael (monty) Widenius. See the CREDITS
           file in the distribution for more credits for mysql and related
           things....
           (application/x-rpm) 2088855 bytes
    

  • catdoc MS Word to text converter

    Home page, also listed on Freshmeat.

    indexer.conf:

    
    Mime application/msword         text/plain      "catdoc $1"
    

  • xls2csv MS Excel to text converter

    It is supplied with catdoc.

    indexer.conf:

    
    Mime application/vnd.ms-excel   text/plain      "xls2csv $1"
    

  • pdftotext Adobe PDF converter

    Supplied with xpdf project.

    Homepage, also listed on Freshmeat.

    indexer.conf:

    
    Mime application/pdf            text/plain      "pdftotext $1 -"
    

  • unrtf RTF to html converter

    Homepage

    indexer.conf:

    
    Mime text/rtf*        text/html  "/usr/local/dpsearch/sbin/unrtf --html $1"
    Mime application/rtf  text/html  "/usr/local/dpsearch/sbin/unrtf --html $1"
    

  • xlhtml XLS to html converter

    Homepage

    indexer.conf:

    
    Mime	application/vnd.ms-excel  text/html  "/usr/local/dpsearch/sbin/xlhtml $1"
    

  • ppthtml PowerPoint (PPT) to html converter. Part of xlhtml 0.5.

    Homepage

    indexer.conf:

    
    Mime	application/vnd.ms-powerpoint  text/html  "/usr/local/dpsearch/sbin/ppthtml $1"
    

  • Using vwHtml (DOC to html).

    /usr/local/dpsearch/sbin/0vwHtml.pl:

    
#!/usr/bin/perl -w
    
    $p = $ARGV[1];
    $f = $ARGV[1];
    
    $p =~ s/(.*)\/([^\/]*)/$1\//;
    $f =~ s/(.*)\/([^\/]*)/$2/;
    
    system("/usr/local/bin/wvHtml --targetdir=$p $ARGV[0] $f");
    

    indexer.conf:

    
    Mime  application/msword       text/html  "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2"
    Mime  application/vnd.ms-word  text/html  "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2"
    

  • swf2html from Flash Search Engine SDK

    indexer.conf:

    
    Mime  application/x-shockwave-flash  text/html  "/usr/local/dpsearch/sbin/swf2html $1"
    

  • djvutxt from djvuLibre

    indexer.conf:

    
    Mime  image/djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
    Mime  image/x.djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
    Mime  image/x-djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
    Mime  image/vnd.djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
    

3.4.6. Other commands uses in indexer.conf

3.4.6.1. Include command

You may include another configuration file in any place of the indexer.conf using Include <filename> command. Absolute path if <filename> starts with "/":


Include /usr/local/dpsearch/etc/inc1.conf

Relative path else:


Include inc1.conf

3.4.6.2. DBAddr command

DBAddr command is URL-style database description. It specify options (type, host, database name, port, user and password) to connect to SQL database. Should be used before any other commands. You may specify several DBAddr commands. In this case DataparkSearch will merge result from every database specified. Command have global effect for whole config file. Format:


DBAddr <DBType>:[//[DBUser[:DBPass]@]DBHost[:DBPort]]/DBName/[?[dbmode=mode]{&<parameter name>=<parameter value>}]

Note: ODBC related. Use DBName to specify ODBC data source name (DSN) DBHost does not matter, use "localhost".

Note: Solid related. Use DBHost to specify Solid server DBName does not matter for Solid

You may use CGI-like encoding for DBUser and DBPass if you need use special characters in user name or password. For example, if you have ABC@DEF as password, you should write it as ABC%40DEF.

Currently supported DBType values are mysql, pgsql, msql, solid, mssql, oracle, ibase, sqlite. Actually, it does not matter for native libraries support. But ODBC users should specify one of supported values. If your database type is not supported, you may use "unknown" instead.

MySQL and PostgreSQLusers can specify path to Unix socket when connecting to localhost: mysql://foo:bar@localhost/dpsearch/?socket=/tmp/mysql.sock

If you are using PostgreSQL and do not specify hostname, e.g. pgsql://user:password@/dbname/ then PostgreSQL will not work via TCP, but will use default Unix socket.

dbmode parameter. You may also select database mode of words storage. When "single" is specified, all words are stored in the same table (file). If "multi" is selected, words will be located in different tables (files) depending of their lengths. "multi" mode is usually faster but requires more tables (files). If "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disk space and it is faster comparing with "single" and "multi" modes, however it doesn't support substring searches. "crc-multi" uses the same storage structure with the "crc" mode, but also stores words in different tables (files) depending on words lengths like "multi" mode. Default mode is "single".

stored parameter. Format: stored=StoredHost[:StoredPort]. This parameter is used to specify host and port, if specified, where stored daemon is running, if you plan to use document excerpts and cached copies.

cached parameter. Format: cached=CachedHost[:CachedPort]. Use cached at given host and port if specified. It is required for cache storage mode only (see Section 5.2). Each indexer will connect to cached on given address at startup.

charset parameter. Format: charset=DBCharacterSet. This parameter can be used to specity database connection charset. The charset specified by DBCharacterSet should be equal to charset specified by LocalCharset command.

Example:


DBAddr          mysql://foo:bar@localhost/dpsearch/?dbmode=single

3.4.6.3. VarDir command

You may choose alternative working directory for cache mode:


VarDir /usr/local/dpsearch/var

3.4.6.4. NewsExtensions command

Whether to enable news extensions. Default value is no.


NewsExtensions yes

3.4.6.5. SyslogFacility command

This is used if DataparkSearch was compiled with syslog support and if you don't like the default value. Argument is the same as used in syslog.conf file. For list of possible facilities see syslog.conf(5)


SyslogFacility local7

3.4.6.6. LocalCharset command

Defines the charset which will be used to store data in database. All other character sets will be recoded into given charset. Take a look into Section 7.1 for detailed explanation how to choose a LocalCharset depending on languages used on your site(s). This command should be used once and takes global effect for the config file. Take a look into documentation to check whole list of supported charsets. Default LocalCharset is iso-8859-1 (latin1).


LocalCharset koi8-r

3.4.6.7. ForceIISCharset1251 command

This option is useful for users which deals with Cyrillic content and broken (or misconfigured ?) Microsoft IIS web servers, which tends to not report charset correctly. This is really dirty hack, but if this option is turned on it is assumed that all servers which reports as 'Microsoft' or 'IIS' have content in Windows-1251 charset. This command should be used only once in configuration file and takes global effect. Default: no


ForceIISCharset1251 yes

3.4.6.8. StopwordFile command

Load stop words from the given text file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several StopwordFile commands.


StopwordFile stopwords/en.sl

3.4.6.9. LangMapFile command

Load language map for charset and language guesser from the given file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several LangMapFile commands.


LangMapFile langmap/en.ascii.lm

3.4.6.10. Word length commands

Word lengths. You may change default length range of words stored in database. By default, words with the length in the range from 1 to 32 are stored.


MinWordLength 1
MaxWordLength 32

3.4.6.11. MaxDocSize command

This command is used for specify maximal document size. Default value 1048576 (1 Megabyte). Takes global effect for whole config file.


MaxDocSize 1048576

3.4.6.12. MinDocSize command

This command is used to checkonly urls with content size less than value specified. Default value 0. Takes global effect for whole config file.


MinDocSize 1024

3.4.6.13. IndexDocSizeLimit command

Use this command to specify the maximal amount of data stored in index per document. Default value 0. This mean no limit. Takes effect till next IndexDocSizeLimit command.


IndexDocSizeLimit 65536

3.4.6.14. URLSelectCacheSize command

Select number of targets to index at once. Default value is 1024.


URLSelectCacheSize 10240

3.4.6.15. URLDumpCacheSize command

Select at once this number of urls to write cache mode indexes, to preload url data or to calculate the Popularity Rank. Default value is 100000.


URLDumpCacheSize 10240

3.4.6.16. HTTPHeader command

You may add your desired headers in indexer HTTP request. You should not use "If-Modified-Since", "Accept-Charset" headers, these headers are composed by indexer itself. "User-Agent: DataparkSearch/version" is sent too, but you may override it. Command has global effect for all configuration file.


HTTPHeader "User-Agent: My_Own_Agent"
HTTPHeader "Accept-Language: ru, en"
HTTPHeader "From: webmaster@mysite.com"

3.4.6.17. Allow command


Allow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to allow URLs that match (doesn't match) given argument. First three optional parameters describe the type of comparison. Default values are Match, NoCase, String. Use NoCase or Case values to choose case sensitive or case insensitive comparison. Use Regex to choose regular expression comparison. Use String to choose string with wildcards comparison. Wildcards are '*' for any number of characters and '?' for one character. Note that '?' and '*' have special meaning in String match type. Please use Regex to describe documents with '?' and '*' signs in URL. String match is much faster than Regex. Use String where it is possible. You may use several arguments for one Allow command. You may use this command any times. Takes global effect for config file. Note that DataparkSearch automatically adds one "Allow regex .*" command after reading config file. It means that allowed everything that is not disallowed.

Examples


#  Allow everything:
Allow *
#  Allow everything but .php .cgi .pl extensions case insensitively using regex:
Allow NoMatch Regex \.php$|\.cgi$|\.pl$
#  Allow .HTM extension case sensitively:
Allow NoCase *.HTM

3.4.6.18. Disallow command


Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Examples:


# Disallow URLs that are not in udm.net domains using "string" match:
Disallow NoMatch *.udm.net/*
# Disallow any except known extensions and directory index using "regex" match:
Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
# Exclude cgi-bin and non-parsed-headers using "string" match:
Disallow */cgi-bin/* *.cgi */nph-*
# Exclude anything with '?' sign in URL. Note that '?' sign has a 
# special meaning in "string" match, so we have to use "regex" match here:
Disallow Regex  \?

3.4.6.19. CheckOnly command


CheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Indexer will use HEAD instead of GET HTTP method for URLs that match/do not match given regular expressions. It means that the file will be checked only for being existing and will not be downloaded. Useful for zip,exe,arj and other binary files. Note that you can disallow those files with commands given below. You may use several arguments for one CheckOnly commands. Useful for example for searching through the URL names rather than the contents (a la FTP-search). Takes global effect for config file. Examples:


# Check some known non-text extensions using "string" match:
CheckOnly *.b	  *.sh   *.md5
# or check ANY except known text extensions using "regex" match:
CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$

3.4.6.20. HrefOnly command


HrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Use this to scan a HTML page for "href" attribute of tags but not to index the contents of the page with an URLs that match (doesn't match) given argument. Commands have global effect for all configuration file. When indexing large mail list archives for example, the index and thread index pages (like mail.10.html, thread.21.html, etc.) should be scanned for links but shouldn't be indexed:


HrefOnly */mail*.html */thread*.html

3.4.6.21. CheckMp3 command


CheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer will download only a little part of the document and try to find MP3 tags in it. On success, indexer will parse MP3 tags, else it will download whole document then parse it as usual. Notes: This works only with those servers which support HTTP/1.1 protocol. It is used "Range: bytes" header to download mp3 tag.


CheckMp3 *.bin *.mp3

3.4.6.22. CheckMp3Only command


CheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer, like in the case CheckMP3 command, will download only a little part of the document and try to find MP3 tags. On success, indexer will parse MP3 tags, else it will NOT download whole document.


CheckMP3Only *.bin *.mp3

3.4.6.23. IndexIf command


IndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to allow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.4.6.17).

Example


IndexIf regex Title Manual
IndexIf body "*important detail*"

3.4.6.24. NoIndexIf command


NoIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to disallow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.4.6.17).

Example


NoIndexIf regex Title Sex
IndexIf body *xxx*

3.4.6.25. HoldBadHrefs command


HoldBadHrefs <time>

How much time to hold URLs with erroneous status before deleting them from the database. For example, if host is down, indexer will not delete pages from this site immediately and search will use previous content of these pages. However if site doesn't respond for a month, probably it's time to remove these pages from the database. For <time> format see description of Period command in Section 3.4.6.30.


HoldBadHrefs 30d

3.4.6.26. DeleteOlder command


DeleteOlder <time>

How much time to hold URLs before deleting them from the database. For example, for news sites indexing, you may delete automatically old news articles after specified period. For <time> format see description of Period command in Section 3.4.6.30. Default value is 0. "0" value mean "do not check". You may specify several DeleteOlder commands, for example, by one for every Server command.


DeleteOlder 7d

3.4.6.27. UseRemoteContentType command


UseRemoteContentType yes/no

This command specifies if the indexer should get content type from http server headers (yes) or from it's AddType settings (no). If set to 'no' and the indexer could not determine content-type by using its AddType settings, then it will use http header. Default: yes


UseRemoteContentType yes

3.4.6.28. AddType command


AddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>...]

This command associates filename extensions (for services that don't automatically include them) with their mime types. Currently "file:" protocol uses these commands. Use optional first two parameter to choose comparison type. Default type is "String" "Case" (case insensitive string match with '?' and '*' wildcards for one and several characters correspondently).


AddType image/x-xpixmap	*.xpm

3.4.6.29. ParserTimeOut command

Use ParserTimeOut command to specify amount of time for parser execution to avoid possible indexer hang.


ParserTimeOut 300

3.4.6.30. Period command


Period <time>

Set reindex period. <time> is in the form 'xxxA[yyyB[zzzC]]' (Spaces are allowed between xxx and A and yyy and so on) there xxx, yyy, zzz are numbers (can be negative!) A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions). Examples:


 15s - 15 seconds
 4h30M - 4 hours and 30 minutes
 1y6m-15d - 1 year and six month minus 15 days
 1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only number without any character, it is assumed that time is given in seconds. Can be set many times before Server command and takes effect till the end of config file or till next Period command.


Period 7d

3.4.6.31. PeriodByHops command


PeriodByHops <hops> [ <time> ]

Set reindex period per <hops> basis. The format for <time> is the same as for Period.

Can be set many times before Server command and takes effect till the end of config file or till next PeriodByHops command with same <hops> value. If <time> parameter is omitted, this undefine the previous defined value.

If for given <hops> value the appropriate PeriodByHops command is not specified, in this case the value defined in Period command is used.

3.4.6.32. ExpireAt command


ExpireAt [ A [ B [ C [ D [ E ]]]]]

This command allow specify the exactly expiration time for documents. May be specified per Server/Realm basis and takes effect till the end of config file or till next ExpireAt command. ExpireAt specified without any arguments disable previously specified value. A - stand for minute, may be * or 0-59; B - stand for hour, may be * or 0-23; C - stand for day of month, may be * or 1-31; D - stand for month, may be * or 1-12; E - stand for day of week, may be * or 0-6, 0 - is Sunday. ExpireAt command have higher prioroty over Period or PeriodByHops command.

3.4.6.33. Tag command


Tag <string>

Use this field for your own purposes. For example for grouping some servers into one group, etc... During search you'll be able to limit URLs to be searched through by their tags. Can be set multiple times before Server command and takes effect till the end of config file or till next Tag command. Default values is an empty string.

3.4.6.34. TagIf command


TagIf <tag> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Mark document by <tag> tag, if the value of section match the arg pattern given. The meaning of first three optional parameters are exactly the same as for the Allow command (see Section 3.4.6.17).

Example


TagIf Docs regex Title Manual

3.4.6.35. Category command


Category <string>

You may distribute documents between nested categories. Category is a string in hex number notation. You may have up to 6 levels with 256 members per level. Empty category means the root of category tree. Take a look into Section 6.2 for more information.


# This command means a category on first level:
Category AA
# This command means a category on 5th level:
Category FFAABBCCDD

3.4.6.36. CategoryIf command


CategoryIf <category> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Mark document by <category> category, if the value of section match arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.4.6.17).

Example


CategoryIf 010F regex Title "JOB ID"

3.4.6.37. DefaultLang command


DefaultLang <string>

Default language for server. Can be used if you need language restriction while doing search.


DefaultLang en

3.4.6.38. MaxHops command


MaxHops <number>

Maximum way in "mouse clicks" from start url. Default value is 256. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxHops command.


MaxHops 256

3.4.6.39. TrackHops command


TrackHops yes|no

This command enable or disable hops tracking in reindexing. Default value is no. If enabled, the value of hops for url is recalculated when reindexing. Otherwise the value of hops is calculated only once at insertion of url into database.


TrackHops yes

3.4.6.40. MaxDocsPerServer command


MaxDocsPerServer <number>

Limits the number of documents retrieved from Server. Default value is -1, that means no limits. If set to positive value, no more than given number of pages will be indexed from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxDocsPerServer command.


MaxDocsPerServer 100

3.4.6.41. MaxNetErrors command


MaxNetErrors <number>

Maximum network errors for each server. Default value is 16. Use 0 for unlimited errors number. If there too many network errors on some server (server is down, host unreachable, etc) indexer will try to do not more then 'number' attempts to connect to this server. Takes effect till the end of config file or till next MaxNetErrors command.


MaxNetErrors 16

3.4.6.42. ReadTimeOut command


ReadTimeOut <time>

Connect timeout and stalled connections timeout. For <time> format see Section 3.4.6.30. Default value is 30 seconds. Can be set any times before Server command and takes effect till the end of config file or till next ReadTimeOut command.


ReadTimeOut 30s

3.4.6.43. DocTimeOut command


DocTimeOut <time>

Maximum amount of time indexer spends for one document downloading. For <time> format see Section 3.4.6.30. Default value is 90 seconds. Can be set any times before Server command and takes effect till the end of config file or till next DocTimeOut command.


DocTimeOut 1m30s

3.4.6.44. NetErrorDelayTime command


NetErrorDelayTime <time>

Specify document processing delay time if network error has occurred. For <time> format see Section 3.4.6.30. Default value is one day


NetErrorDelayTime 1d

3.4.6.45. Cookies command


Cookies yes/no

Enables/Disables the support for HTTP cookies. Command may be used several times before Server command and takes effect till the end of config file or till next Cookies command. Default value is "no".


Cookies yes

3.4.6.46. Robots command


Robots yes/no

Allows/disallows using robots.txt and <META NAME="robots" ...> exclusions. Use no, for example for link validation of your server(s). Command may be used several times before Server command and takes effect till the end of config file or till next Robots command. Default value is "yes".


Robots yes

3.4.6.47. RobotsPeriod command

By defaults, robots.txt data holds in SQL-database for one week. You may change this period using RobotsPeriod command:


RobotsPeriod <time>
For <time> format see description of Period command in Section 3.4.6.30.

RobotsPeriod 30d

3.4.6.48. DetectClones command


DetectClones yes/no

Allow/disallow clone detection and eliminating. If allowed, indexer will detect the same documents under different location, such as mirrors, and will index only one document from the group of such equal documents. "DetectClones yes" also allows to reduce space usage. Default value is "yes".


DetectClones no

3.4.6.49. Section command


Section <string> <number> <maxlen> [ <pattern> <replacement> ]

where <string> is a section name and <number> is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different IDs for different sections. In this case during search time you'll be able to give different weight to each section or even disallow some sections at a search time. <maxlen> argument contains a maximum length of section which will be stored in database. Use 0 for <maxlen>, if you don't want to store this section. <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content.


# Standard HTML sections: body, title
Section	body			1	256
Section title			2	128
Section GoodName                3       128 "<h1>([^<]*)</h1>" "<b>GoodName:</b> $1"

3.4.6.50. HrefSection command


HrefSection <string> [ <pattern> <replacement> ]

where <string> is a section name, <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content. Use this command to extract links from document content.


# Standard HTML sections: body, title
HrefSection	link
HrefSection     NewLink "<newlink>([^<]*)</newlink>" "$1"

3.4.6.51. Index command


Index yes/no

Prevent indexer from storing words into database. Useful for example for link validation. Can be set multiple times before Server command and takes effect till the end of config file or till next Index command. Default value is "yes".


Index no

3.4.6.52. RemoteCharset command


RemoteCharset <charset>

<charset> is default character set for the server in next Server, Realm or Subnet command(s). This is required only for "bad" servers that do not send information about charset in header: "Content-type: text/html; charset=some_charset" and do not have <META NAME="Content" Content="text/html; charset="some_charset"> Can be set before every Server, Realm or Subnet command and takes effect till the end of config file or till next RemoteCharset command. Default value is iso-8859-1 (latin1).


RemoteCharset iso-8859-5

3.4.6.53. URLCharset command


URLCharset <charset>

<charset> is character set for the URL argument in next Server, Realm or URL command(s). This command specify character set only for arguments in commands follow and havn't effect on charset detection for indexing pages. Have less priority than RemoteCharset. Can be set before every Server, Realm or URL command and takes effect till the end of config file or till next URLCharset command. Default value is ISO-8859-1 (latin1).


URLCharset KOI8-R

3.4.6.54. ProxyAuthBasic command


ProxyAuthBasic login:passwd

Use http proxy basic authorization. Can be used before every Server command and takes effect only for next one Server command! It should be also before Proxy command. Examples:


ProxyAuthBasic somebody:something  

3.4.6.55. Proxy command


Proxy your.proxy.host[:port]

Use proxy rather then connect directly. One can index ftp servers when using proxy Default port value if not specified is 3128 (Squid) If proxy host is not specified direct connect will be used. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. If no one Proxy command specified indexer will use direct connect. Examples:


#           Proxy on atoll.anywhere.com, port 3128:
Proxy atoll.anywhere.com
#           Proxy on lota.anywhere.com, port 8090:
Proxy lota.anywhere.com:8090
#           Disable proxy (direct connect):
Proxy

3.4.6.56. AuthBasic command


AuthBasic login:passwd

Use basic http authorization. Can be set before every Server command and takes effect only for next one Server command! Examples:


AuthBasic somebody:something  

# If you have password protected directory(-ies), but whole server is open,use:
AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/

3.4.6.57. ServerWeight command


ServerWeight <number>

Server weight for Popularity Rank calculation (see Section 8.5.3). Default value is 1.


ServerWeight 1

3.4.6.58. OptimizeAtUpdate command


OptimizeAtUpdate yes

Specify word index optimize strategy. Default value: no If enabled, this save disk space, but slow down indexing. May be placed in indexer.conf and cached.conf.

3.4.6.59. SkipUnreferred command


SkipUnreferred yes

Default value: no. Use this command to skip reindexing for unreferred documents. Unreferred document is document with no links to it. This command require the links collection to be enabled (see Section 8.5.3).

3.4.6.60. Bind command


Bind 127.0.0.1

You may use this command to specify local ip address, if your system have several network interfaces.

3.4.6.61. URL command


URL http://localhost/path/to/page.html

This command inserts given URL into database. This is usefull to add several entry points to one server. Has no effect if an URL is already in the database.

3.4.6.62. ServerDB, RealmDB, SubnetDB and URLDB commands


URLDB pgsql://foo:bar@localhost/portal/links?field=url

These commands are equal to Server, Realm, Subnet and URL commands respectively, but takes arguments from field of SQL-table specified. In example above, URLs are takes from database portal, SQL-table links and filed url.