This document describes how to use the API of Hyper Estraier. If you have never read the user's guide yet, please do it beforehand.
The API enables to realize many requirements which is impossible with estcmd
and estsearch.cgi
only. While estcmd
can handle documents as files, it is possible to make an application to handle records in a relational database as a document by using the library. While estseek.cgi
is accessed with a web browser, it is possible to make an application with a GUI based on the native OS.
The core API of Hyper Estraier provides some functions to manage the inverted index only. That is, processes of retrieving documents and calculating them are assigned to an application. Also, processes to display the search result is assigned to the application. Consequently, Hyper Estraier does not depend on any document repository, any file format, nor any user interface. They can be selected by the author of the application.
Hyper Estraier handles Unicode (UCS-2) and present it as UTF-8. That is, most languages in the current world is available. Moreover, because keys of search are extracted from the body text by N-gram method, Hyper Estraier does not depend on any vocabulary.
One of characteristics of Hyper Estraier is high scalability. So, the author of the application does not need to consider the scalability as long as using the API of Hyper Estraier.
As this document describes the core API, Hyper Estraier provides the node API based on P2P architecture. Refer to the P2P Guide for the node API.
This section describes the architecture of the core API of Hyper Estraier.
The term gatherer means functions to register documents to the index. A gatherer is to be implemented in an application. For example, estcmd
has functions to collect documents by scanning the file system. There are the following procedures.
The term filter means functions to extract attributes and body text from a file. A filter is implemented in the application. While it can be an own implementation of the application, it can be realized by using some existing library. Moreover, it can be realized by calling an outer command.
The term searcher means functions to search the index for some documents corresponding to conditions specified by users. A searcher is implemented in the application. For example, estseek.cgi
has functions to display the search result as HTML, called as a CGI script by the web server. There is the following procedures.
Snippet of the body text is useful for the result to be straightforward. While a function is provided to create snippet by the API, the application can implement its own function.
The following is illustration of a typical architecture of the application of Hyper Estraier. As it is no more than a concept, you can design your own architecture.
As Hyper Estraier provides the API for the C language, an application is implemented in C or C++. This section describes how to build the application with the library of Hyper Estraier.
In each source of applications of the core API, include `estraier.h
', `cabin.h
', and `stdlib.h
'. `estraier.h
' is a header file of Hyper Estraier. `cabin.h
' is a header file of QDBM. See the document of QDBM for the functions provided by `cabin.h
'.
#include <estraier.h> #include <cabin.h> #include <stdlib.h>
By default, headers of Hyper Estraier are installed in "/usr/local/include
", and libraries are installed in "/usr/local/lib
". Other underlying libraries except for `-lestraier
' are `-lresolv
', `-lnsl
', `-lpthread
', `-lqdbm
', `-lz
' `-liconv
', `-lm
', and `-lc
'. That is, perform the following command to build an application.
gcc -I/usr/local/include -o foobar foobar.c \ -L/usr/local/lib -lestraier -lresolv -lnsl -lpthread -lqdbm -lz -liconv -lm -lc
However, the above does not works if the destination of installation is different. It is suggested to use estconfig
in order to improve maintainability, as the following.
gcc `estconfig --cflags` -o foobar foobar.c `estconfig --ldflags` `estconfig --libs`
estconfig
is useful for integration of an application or a system using Hyper Estraier. It outputs setting and configurations of Hyper Estraier.
estconfig
always returns 0 as the exit status.
The API for documents aims to handle documents which were registered into the index.
The type of the structure `ESTDOC
' is for abstraction of a document. A document is composed of some attributes and some text sentences. No entity of `ESTDOC
' is accessed directly, but it is accessed by the pointer. The term of document object means the pointer and its referent. A document object is created by the function `est_doc_new
' and destroyed by `est_doc_delete
'. Every created document object should be destroyed.
Target documents of search are to be registered in the database beforehand. The ID is assigned to each registered document. When search, they can be retrieved from the database by their ID. The encoding of attributes and text sentences should be UTF-8.
The following is a typical use case of document object.
ESTDOC *doc; /* create a document object */ doc = est_doc_new(); /* add the URI and the title as attributes */ est_doc_add_attr(doc, "@uri", "http://foo.bar/baz.txt"); est_doc_add_attr(doc, "@title", "Now Scream"); /* add text sentences */ est_doc_add_text(doc, "Give it up, Yo! Give it up, Yo!"); est_doc_add_text(doc, "Check it out, come on!"); /* register the object or display it here */ /* destroy the object */ est_doc_delete(doc);
The function `est_doc_new' is used in order to create a document object.
The function `est_doc_new_from_draft' is used in order to create a document object made from draft data.
The function `est_doc_delete' is used in order to destroy a document object.
The function `est_doc_add_attr' is used in order to add an attribute to a document object.
The function `est_doc_add_text' is used in order to add a sentence of text to a document object.
The function `est_doc_add_hidden_text' is used in order to add a hidden sentence to a document object.
The function `est_doc_id' is used in order to get the ID number of a document object.
The function `est_doc_attr_names' is used in order to get a list of attribute names of a document object.
The function `est_doc_attr' is used in order to get the value of an attribute of a document object.
The function `est_doc_texts' is used in order to get a list of sentences of the text of a document object.
The function `est_doc_cat_texts' is used in order to concatenate sentences of the text of a document object.
The function `est_doc_dump_draft' is used in order to dump draft data of a document object.
The function `est_doc_make_snippet' is used in order to make a snippet of the body text of a document object.
The API for search conditions aims to specify search conditions given to the index.
The type of the structure `ESTCOND
' is for abstraction of search conditions. A unit of search conditions is composed of one search phrase, some attribute expressions, and one order expression. No entity of `ESTCOND
' is accessed directly, but it is accessed by the pointer. The term of condition object means the pointer and its referent. A condition object is created by the function `est_cond_new
' and destroyed by `est_cond_delete
'. Every created condition object should be destroyed.
Condition objects are used as a parameter to search for documents registered in the database so that a list of IDs of corresponding documents are returned. See the manual for the formats of expressions. The encoding of conditional expressions should be UTF-8.
The following is a typical use case of condition object.
ESTCOND *cond; /* create a condition object */ cond = est_cond_new(); /* set the search phrase */ est_cond_set_phrase(cond, "check AND out"); /* set the attribute expression */ est_cond_add_attr(cond, "@uri ISTREW .txt"); /* search the database here */ /* destroy the object */ est_cond_delete(cond);
The function `est_cond_new' is used in order to create a condition object.
The function `est_cond_delete' is used in order to destroy a condition object.
The function `est_cond_set_phrase' is used in order to set the search phrase to a condition object.
The function `est_cond_add_attr' is used in order to add an expression for an attribute to a condition object.
The function `est_cond_set_order' is used in order to set the order of a condition object.
The function `est_cond_set_max' is used in order to set the maximum number of retrieval of a condition object.
The function `est_cond_set_options' is used in order to set options of retrieval of a condition object.
The API for database aims to handle the database of the index.
The type of the structure `ESTDB
' is for abstraction of access methods to database. A database has inverted index, document data, and meta data. One of writer or reader is selected when the connection is established. No entity of `ESTDB
' is accessed directly, but it is accessed by the pointer. The term of database object means the pointer and its referent. A database object is created by the function `est_db_open
' and destroyed by `est_db_close
'. Every created database object should be destroyed.
Errors with some operations are informed to by the function `est_db_error
'. The meaning of each error code can be gotten as a string by the function `est_err_msg
'.
The following is a typical use case of database object.
ESTDB *db int ecode; /* create a database object as a write */ if(!(db = est_db_open("casket", ESTDBWRITER | ESTDBCREAT, &ecode))){ /* if failure, return after displaying the error message */ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return -1; } /* register documents or search for documents here */ /* destroy the object */ if(!est_db_close(db, &ecode)){ /* if failure, return after displaying the error message */ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return -1; }
The following constant are defined for error codes.
The function `est_err_msg' is used in order to get the string of an error code.
The function `est_db_open' is used in order to open a database.
The function `est_db_close' is used in order to close a database.
The function `est_db_error' is used in order to get the last happened error code of a database.
The function `est_db_fatal' is used in order to check whether a database has a fatal error.
The function `est_db_flush' is used in order to flush index words in the cache of a database.
The function `est_db_sync' is used in order to synchronize updating contents of a database.
The function `est_db_optimize' is used in order to optimize a database.
The function `est_db_put_doc' is used in order to add a document to a database.
The function `est_db_out_doc' is used in order to remove a document from a database.
The function `est_db_edit_doc' is used in order to edit attributes of a document in a database.
The function `est_db_get_doc' is used in order to retrieve a document in a database.
The function `est_db_get_doc_attr' is used in order to retrieve the value of an attribute of a document in a database.
The function `est_db_uri_to_id' is used in order to get the ID of a document specified by URI.
The function `est_db_name' is used in order to get the name of a database.
The function `est_db_doc_num' is used in order to get the number of documents in a database.
The function `est_db_word_num' is used in order to get the number of unique words in a database.
The function `est_db_size' is used in order to get the size of a database.
The function `est_db_search' is used in order to search documents corresponding a condition for a database.
The function `est_db_scan_doc' is used in order to check whether a document object matches the phrase of a search condition object definitely.
The function `est_db_set_cache_size' is used in order to set the maximum size of the cache memory of a database.
The following is the simplest implementation of a gatherer.
#include <estraier.h> #include <cabin.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv){ ESTDB *db; ESTDOC *doc; int ecode; /* open the database */ if(!(db = est_db_open("casket", ESTDBWRITER | ESTDBCREAT, &ecode))){ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return 1; } /* create a document object */ doc = est_doc_new(); /* add attributes to the document object */ est_doc_add_attr(doc, "@uri", "http://estraier.gov/example.txt"); est_doc_add_attr(doc, "@title", "Over the Rainbow"); /* add the body text to the document object */ est_doc_add_text(doc, "Somewhere over the rainbow. Way up high."); est_doc_add_text(doc, "There's a land that I heard of once in a lullaby."); /* register the document object to the database */ if(!est_db_put_doc(db, doc, ESTPDCLEAN)) fprintf(stderr, "error: %s\n", est_err_msg(est_db_error(db))); /* destroy the document object */ est_doc_delete(doc); /* close the database */ if(!est_db_close(db, &ecode)){ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return 1; } return 0; }
The following is the simplest implementation of a searcher.
#include <estraier.h> #include <cabin.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv){ ESTDB *db; ESTCOND *cond; ESTDOC *doc; const CBLIST *texts; int ecode, *result, resnum, i, j; const char *value; /* open the database */ if(!(db = est_db_open("casket", ESTDBREADER, &ecode))){ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return 1; } /* create a search condition object */ cond = est_cond_new(); /* set the search phrase to the search condition object */ est_cond_set_phrase(cond, "rainbow AND lullaby"); /* get the result of search */ result = est_db_search(db, cond, &resnum, NULL); /* for each document in the result */ for(i = 0; i < resnum; i++){ /* retrieve the document object */ if(!(doc = est_db_get_doc(db, result[i], 0))) continue; /* display attributes */ if((value = est_doc_attr(doc, "@uri")) != NULL) printf("URI: %s\n", value); if((value = est_doc_attr(doc, "@title")) != NULL) printf("Title: %s\n", value); /* display the body text */ texts = est_doc_texts(doc); for(j = 0; j < cblistnum(texts); j++){ value = cblistval(texts, j, NULL); printf("%s\n", value); } /* destroy the document object */ est_doc_delete(doc); } /* free the result of search */ free(result); /* destroy the search condition object */ est_cond_delete(cond); /* close the database */ if(!est_db_close(db, &ecode)){ fprintf(stderr, "error: %s\n", est_err_msg(ecode)); return 1; } return 0; }
Databases of Hyper Estraier are protected by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not.
If you use multi thread, it is suggested to use the MT-safe API of Hyper Estraier. It is a wrapper to make the core API thread-safe. As the MT-safe API provides the same functions as with the core API, the following is different.
If QDBM was built with `--enable-pthread
', mutex protection is performed for each connection, not in global. So, it is recommended when you use the MT-safe API.