Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Free Software Foundation.

Webbase
*******

   `webbase' is a web crawler written in C and based on a MySQL
database.

   This is `Webbase Manual', last updated 8 August 2000 for `webbase'
Version 5.12.0.

   This manual describes `webbase' and contains the following chapters:

Introduction
************

What is `webbase' ?
===================

   `webbase' is a crawler for the Internet. It has two main functions :
crawl the WEB to get documents and build a full text database with this
documents.

   The crawler part visit the documents and store intersting
information about them localy. It visits the document on a regular
basis to make sure that it is still there and updates it if it changes.

   The full text database uses the localy copies of the document to
build a searchable index. The full text indexing functions are *not*
included in `webbase'.

What `webbase' is not ?
=======================

   `webbase' is not a full text database. It uses a full text database
to search the content of the URLs retrieved.

Getting webbase
***************

   The home site of webbase is Senga http://www.senga.org/
(http://www.senga.org/). It contains the software, online documentation,
formated documentation and related software for various platforms.

   This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.

   This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

   You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Overview of the concepts used by `webbase'
******************************************

   The crawler (or robot or spider) works according to specifications
found in the Home Pages database.

   The Home Pages database is a selection of starting points for the
crawler, including specifications that drives its actions for each
starting point.

   The crawler is in charge of maintaining an up-to-date image of the
WEB on the local disk. The set of URLs concerned is defined by the Home
Pages database.

   Using this local copy of the WEB, the full text database will build
a searchable index to allow full text retrieval by the user.

Home Pages Database
===================

   The Home Pages database is a list of starting points. The `webbase'
crawler is not designed to explore the entire WEB. It is best suited to
build specialized search engines on a particular topic.  The best way
to start an `webbase' crawler is to put a bookmark file in the Home
Pages database.

The crawler that fetches WEB documents
======================================

   The `crawler' is working on a set of URLs defined by the Home Pages
database. It loads each page listed in the Home Pages database and
extracts hypertexts links from them.  Then it explores these links
recursively until there are no more pages or the maximum number of
documents has been loaded.

Using a full text database
==========================

   The full text databases are designed to work with local files, not
with URLs. This is why the crawler has to keep a copy of the URLs found
on the local disk. In fact, a full text database able to handle URLs is
called an Internet search engine :-)

   The *hooks* library in *webbase* is designed to provide a bridge
between the crawler and a full text indexing library. It contains a
stub that does nothing (the `hooks.c' file) and an interface to the
*mifluz* full text indexing library (see
http://www.senga.org/mifluz/html/ to download it).

Using language recognition functionality
========================================

   When crawling a document, it is possible to retrieve the language of
the document and to store this information in the url table along with
other url information.  To do this, you must use the *langrec* module.
The language recognition module recognizes five languages : french,
english, italian, spanish and german.

File equivalent of a URL
========================

   The job of the crawler is to maintain a file system copy of the WEB.
It is, therefore, necessary to compute a unique file name for each URL.
A FURL is simply a transformation of a URL into a file system PATH
(hence FURL = File URL).

   The crawler uses a MD5 key calculated from the URL as a path name.
For example http://www.foo.com/ is transformed into the MD5 key
33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is
33/02/4c/ec6160eafbd2717e394b5bc201. This allows to store a large
number of files even on file systems that do not support many entries
in the same directory. The drawback is that it's hard to guess which
file contains which URL.

   An alternative encoding of the FURI is available thru the *uri*
library. It's much more readable and can conviniently be used if the
number of URLs crawled is low (less than 50 000). The following figure
shows how the URL is mapped to a PATH.

     # An URL is converted to a FURL in the following way
     #
     #             http://www.ina.fr:700/imagina/index.html#queau
     #               |    \____________/ \________________/\____/
     #               |          |                   |       lost
     #               |          |                   |
     #               |         copied and           |
     #               |         default port         |
     #               |         added if not         |
     #               |         present              |
     #               |          |                   |
     #               |          |                   |
     #               |  /^^^^^^^^^^^^^\/^^^^^^^^^^^^^^^^\
     #            http:/www.ina.fr:700/imagina/index.html
     #

   A problem with URLs is that two URLs can lead to the same document
and not be the same string. A well-known example of that is
`something%2Felse' that is strictly identical to `something/else'. To
cope with this problem a canonical form has been defined and it obeys
complicated rules that leads to intuitive results.

   The mapping is designed to
   * be easy to parse by a program.

   * allow bijective transformation. i.e. a URL can be translated into a
     FURL and a FURL can be translated into a URL, without losing
     information.

   * be readable by humans, which is definitely not the case when you
     use `md5' encoding.

What is part of a WEB and what is not.
======================================

   When starting the exploration of a WEB, the crawler must answer this
question : is that link part of the server I'm exploring or not ? It is
not enough to state that the URL is absolute or relative. The method
used in `webbase' is simple : if the URL is in the same directory of
the same server, then it belongs to the same WEB.

Web Bounds
==========

   When surfing the WEB one can reach a large number of documents but
not all the documents available on the Internet.

   A very common example of this situation arises when someone adds new
documents in an existing WEB server. The author will gradually write and
polish the new documents, testing them and showing them to friends for
suggestions. For a few weeks the documents will exist on the Internet
but will not be public: if you don't know the precise URL of the
document you can surf forever without reaching it. In other words, the
document is served but no links point to it.

   This situation is even more common when someone moves a server from
one location to another. For one or two months the old location of the
server will still answer to requests but links pointing to it will
gradually point to the new one.

Good planning is crucial
========================

   `webbase' is constantly working to keep the base up-to-date. It
follows a planning that you can describe in a crontab.

   The administrator of `webbase' must dispatch the actions so that the
Internet link and the machine are not overloaded for one day and idle
the next day. For instance, it is a good idea to rebuild the database
during the weekend and to crawl every weekday.

Traces and debug help the manager
=================================

   Various flags enable verbose levels in the crawler (see the manual
page).  They are usually quite verbose and only useful to know if the
process is running or not. Error messages are very explicit, at least
for someone who knows the internals of `webbase'.

Memory and disk space considerations
====================================

   The WEB meta information database and the full text database are the
main space eaters. The documents stored in the WLROOT cache can be very
big if they are not expired.

   `webbase' tends to be very memory hungy. An average crawler takes
4Mb of memory. For instance running five simultaneous mirroring
operations you will need 20Mb of memory.

The crawler that mirrors the web
********************************

   The WEB can be viewed as a huge disk with a lot of bad blocks and
really slow access time. The WEB is structured in a hierarchical way,
very similar to file systems found on traditional disks.

   Crawling part of this huge disk (the WEB) for specific purposes
implies the need for specific tools to deal with these particularities.
Most of the tools that we already have to analyse and process data,
work with traditional file systems. In particular the `mifluz' database
is able to efficiently build a full text database from a given set of
files. The `webbase' crawler's main job is to map the WEB on the disk
of a machine so that ordinary tools can work on it.

Running the crawler
===================

   To run the crawler you should use the crawler(1) command.

     crawler -base mybase http://www.ecila.fr/

   will make a local copy of the `http://www.ecila.fr/' URL.

The first round
===============

   When given a set of URLs, the crawler tries to load them all. It
registers them as starting points for later recursion. These URLs will,
therefore, be treated specially. The directory part of the URL will be
used as a filter to prevent loading documents that do not belong to the
same tree during recursion.

Crawler recursion
=================

   For each starting point the crawler will consider all the links
contained in the document. Relative links will be converted to absolute
links. Only the links that start with the same string as the starting
point will be retained. All these documents will be loaded if they
satisfy the recursions conditions (see below).

   Any URLs contained in the pages and that cannot be put in a
cannonical form will be silently rejected.

   When all the documents found in the starting points are explored,
they go through the same process. The recursion keeps exploring the
servers until either it reaches the recursion limit or there are no
more documents.

Robot Exclusion
===============

   Exploration of a web site can be stopped using the robot exclusion
protocol. When the crawler finds a new host, it tries to load the
`robots.txt' file. If it does not find one it assumes that it is
allowed to explore the entire site. If it does find one, the content of
the document is parsed to find directives that may restrict the set of
searchable directories.

   A typical `robots.txt' file looks like

     User-agent: *
     Disallow: /cgi-bin
     Disallow: /secrets

   In addition to the `robots.txt' file, the robot exclusion protocol
forces the robot to not try to load a file from the same server more
than once each minute.

Heuristics
==========

   In order to keep a large number of URLs up-to-date locally,
`webbase' has to apply some heuristics to prevent overloading the
Internet link. Those heuristics are based on error messages
interpretation and delay definitions.

   Error conditions are divided in two classes : transient errors and
fatal errors. Transient errors are supposed to go away after a while
and fatal errors mean that the document is lost for ever.  One error
condition (`Not Found') is between the two : it is defined to be
transient but is most often fatal.

   When transient errors have been detected, a few days are required
before the crawler will try to load the document again typically, 3
days).

   When fatal errors have been detected, the crawler will *never* try
to load it again. The document will go away after a while, however. It
is important to remember that for a few weeks a document is associated
with a fatal error, mainly because there is a good chance that many
pages and catalogs still have a link on it. After a month or two,
however, we can assume that every catalog and every search engine has
been updated and does not contain any reference to the bad document.

   If the document was loaded successfully, the crawler will not try to
reload it before a week or two (two weeks typically). Unless someone is
working very hard, bimonthly updates are quite rare.  When the crawler
tries to load the document again, two weeks after the first try, it is
often informed that the document was not changed (`Not Modified'
condition). In this case the crawler will wait even longer before the
next try (four weeks typically).

   The `Not Found' error condition is supposed to be transient. But
since it is so often fatal, it will be reloaded only after four weeks.
The fatal error condition that is supposed to match the transient `Not
Found' condition is `Gone', but it is almost never used.

Filtering the exploration of a WEB
==================================

   When starting to explore a starting point, the crawler uses a simple
recursive algorithm as described above. It is possible, however, to
control this exploration.

   A filter can be specified to select eligible URLs. The filter is a
`emacs' regular expression. If the expression returns true, the URL is
explored; if it returns false, the URL is skipped.

   Here is an example of a filter:
     !/ecila\.com/ && ;\/french\/;

   It matches the URLs not contained in the `ecila.com' host that have
the `/french/' directory somewhere in their path.

MIME type filtering
===================

   By default `ecila' only accepts document whose MIME type is
`text/*,magnus-internal/*'. You can change this behaviour by setting
the value of the `-accept' option of the crawler.  The values listed
are comma separated and can be either a fully qualified MIME type or
the beginning of a mime type followed by a start like in `text/*'.  For
instance to crawl PostScript documents in addition to HTML, the
following option can be used:

     -accept 'application/postscript,text/html'

   An attempt is made to detect MIME types at an early stage. A table
mapping the common file extensions to their MIME types allows the
crawler to select the file names that are likely to contain such MIME
types. This is not a 100% reliable method since only the server that
provides the document is able to tell the crawler what type of document
it is. But the standard on file name extension is so widely spread and
this method saves so much connections that it is worth the risk.

Panic prevention
================

   Many tests are performed to prevent the crawler from crashing in the
middle of a mirror operation. Exceptions are trapped, malformed URLs
are rejected with a message, etc. Two tests are configurable because
they are sometimes inadequate for some servers.

   If a URL is too long, it is rejected. Sometimes, `cgi-bin' behave in
such a way that they produce a call to themselves, adding a new
parameter to make it slightly different.  If the crawler dives into
this page it will call the `cgi-bin' again and again, and the URL will
keep growing.  When the URL grows over a given size (typically, 160
bytes), it is rejected and a message is issued. It is possible to call
the crawler with a parameter that changes the maximum size of a URL.

   When mirroring large amounts of sites, you sometimes find really
huge files. Log files from WEB servers of 50 Mb for instance.  By
default the crawler limits the size of data loaded from a document to
100K. If the document is larger, the data will be truncated and a
message will be issued. This threshold can be changed if necessary.

Cookies handling
================

   A cookie is a name/value pair assigned to the visitor of a WEB by
the server. This pair is sent to the WEB client when it connects for
the first time. The client is expected to keep track of this pair and
resend it with further requests to the server.

   If a robot fails to handle this protocol the WEB server usually
build a special URL containing a string that identifies the client.
Since all the WEB navigators build relative links from the current URL
seen, the forged URL is used throughout the session and the server is
still able to recognize the user.

   `webbase' honors the cookie protocol transparently. This behaviour
can be desactivated if it produces indesirable results.  This may
happen in the case of servers configured to deal with a restricted set
of clients (only Netscape Navigator and MSIE for instance).

Proxy handling
==============

   A proxy is a gateway to the external world. In the most common case
a single proxy handles all the requests that imply a connection to a
site outside the local network.

   To specify a proxy for a given protocol you should set an
environnement variable. This variable will be read by the crawler and
the specified proxy will be used.

     export http_proxy=http://proxy.provider.com:8080/
     export ftp_proxy=http://proxy.provider.com:8080/

   If you don't want to use the proxy for a particular domain, for
instance a server located on your local network, use the `no_proxy'
variable.  It can contain a list of domains separated by comas.

     export no_proxy="mydom.com,otherdom.com"

   To specify a proxy that will be used by all the commands that call
the crawler, add the `http_proxy', `ftp_proxy' and `no_proxy' variables
in the <user>_env file located in /usr/local/bin or in the home
directory of <user>. To change the values for a specific domain you
just have to localy set the corresponding variables to a different
value.

     export http_proxy=http://proxy.iway.fr:8080/ ; crawler http://www.ecila.fr/

   Using a proxy may have perverse effects on the accuracy of the
crawl. Since the crawler implements heuristics to minimize the document
loaded, its functionalities are partially redundant with those of the
proxy. If the proxy returns a `Not modified' condition for a given
document, it is probably because the proxy cache still consider it as
`Not modified' even though it may have changed on the reference server.

Indexing the documents
**********************

   When the crawler successfully retrieves an URL, it submits it
immediately to the full text database, if any. If you've downloaded the
*mifluz* library, you should compile and link it to *webbase*.

   The crawler calls full text indexing hooks whenever the status of a
document changes. If the document is not found, the delete hook is
called and the document is removed from the full text index. If a new
document is found the insert hook is called to add it in the full text
index. These hooks can be disabled so that the crawler is only used for
mirroring (seek the `-no_hook' option in `crawler' manual page).

Meta database
*************

   The Meta database is a `MySQL' database that contains all the
information needed by the crawler to crawl the WEB. Each exploration
starting point is described in the `start' table.

   The following command retrieves all the URLs known in the `test'
Meta database.

     $ mysql -e "select url from url" test
     url
     http://www.senat.fr/
     http://www.senat.fr/robots.txt
     http://www.senat.fr/somm.html
     http://www.senat.fr/senju98/
     http://www.senat.fr/nouveau.html
     http://www.senat.fr/sidt.html
     http://www.senat.fr/plan.html
     http://www.senat.fr/requete2.html
     http://www.senat.fr/_vti_bin/shtml.dll/index.html/map
     ...

url table
=========

`url'
     Absolute url

`url_md5'
     MD5 encoding of the url field.

`code'
     HTTP code of the last crawl

`mtime'
     Date of the last successfull crawl

`mtime_error'
     Date of the last crawl with error (either fatal or transient)

`tags'
     List of active tags for this URL

`content_type'
     MIME type of the document contained in the URL

`content_length'
     Total length of the document contained in the URL, not including
     headers.

`complete_rowid'
     Value of the `rowid' field of an entry in the `url_complete'
     table. The `url_complete' is filled with information that can be
     very big such as the hypertext links contained in an HTML document.

`crawl'
     Date of the next crawl, calculated according to the heuristics
     defined in the corresponding Home page.

`hookid'
     Internal identifier used by the full text hook. When `mifluz' is
     used it is the value of the `rowid' field. If 0 it means that the
     document was not indexed.

`extract'
     The first characters of the HTML document contained in the URL.

`title'
     The first 80 characters of the title of the HTML document
     contained in the URL.

url_complete table
==================

   This table complements the url table. It contains information that
may be very big so that the url table does not grow too much. An entry
is created in the url_complete table for a corresponding url only if
there is a need to store some data in its fields.

   The URLs stored in the `relative' and `absolute' fields have been
cannonicalized. That means that they are syntactically valid URLs that
can be string compared.

`keywords'
     content of the `meta keywords' HTML tag, if any.

`description'
     content of the `meta description' HTML tag, if any.

`cookie'
     original cookie associated with the URL by the server, if any.

`base_url'
     URL contained in the `<base>' HTML tag, if any.

`relative'
     A white space separated list of relative URLs contained in the
     HTML document.

`absolute'
     A white space separated list of absolute URLs contained in the
     document.

`location'
     The URL of the redirection if the URL is redirected.

mime2ext table
==============

   The mime2ext table associates all known mime types to file name
extensions. Adding an entry in this table such as

     insert into mime2ext values ('application/dsptype', 'tsp,');

   will effectively prevent loading the URLs that en with the `.tsp'
extension. Note that if you want to add a new MIME type so that it is
recognized by the crawler and loaded, you should also update the list
of MIME types listed in the set associated with the `content_type'
field of the url table.

`mime'
     Fully qualified MIME type.

`ext'
     Comma separated list of file name extensions for this MIME type.
     The list `MUST' be terminated by a comma.

mime_restrict table
===================

   The mime_restrict table is a cache for the crawler. If the mime2ext
table is modified, the mime_restrict table should be cleaned with the
following command:

     emysql -e 'delete from mime_restrict' <base>

   This deletion may be safely performed even if the crawler is running.

indexes
=======

   The url table is indexed on the `rowid' and on the `url_md5'.

   The url_complete table is indexed on the `rowid' only.

   The mime2ext table is indexed on the `mime' and `ext' fields.

Homes Pages database
********************

   The Home Pages database is a collection of URLs that will be used by
the crawler as starting points for exploration.  Universal robots like
`AltaVista' do not need such lists because they explore everything. But
specialized robots like `webbase' have to define a set of URLs to work
with.

   Since the first goal of `webbase' was to build a search engine
gathering French resources, the definition and the choice of the
attributes associated with each Home Page is somewhat directed to this
goal. In particular there is no special provision to help building a
catalog: no categories, no easy way to submit a lot of URLs at once, no
support to quickly test a catalog page.

Home Pages database storage
===========================

   The Home Pages database is stored in the `start' table of the meta
information database. The fields of the table are divided in three main
classes.

   * Information used for crawling. These are the fields used by the
     crawler to control the exploration of a particular Home Page.
     Depth for instance is the maximum number of documents retrieved by
     the crawler for this Home Page.

   * Internal data. These fields must `not' be changed and are for
     internal use by the crawler.

   * User defined data. The user may add fields in this part or remove
     some. They are only used by the cgi-bin that register a new
     starting point.


   Here is a short list of the fields and their meaning. Some of them
are explained in greater details in the following sections.

`url'
     The URL of the Home Page, starting point of exploration.

`url_md5'
     MD5 encoding of the url field.

`info'
     State information about the home page.

`depth'
     Maximum number of documents for this Home Page.

`level'
     Maximum recursion level for this Home Page.

`timeout'
     Reader tiemout value for this Home Page. Some WEBs are slow to
     respond and the default value of 60 seconds may not be enough.

`loaded_delay'
     Number of days to wait before next crawl if the URL is loaded
     successfully for the first time.

`modified_delay'
     Number of days to wait before next crawl if the URL was stated
     `Not Modified' at last crawl.

`not_found_delay'
     Number of days to wait before next crawl if the URL was stated
     `Not Found' at last crawl.

`timeout_delay'
     Number of days to wait before next crawl if the URL could not be
     reached because the serveur was down or a timeout occured during
     the load or any transient error.

`robot_delay'
     Number of seconds to wait between two crawls on this server. The
     default of 60 seconds may be reduced to 0 when crawling an Intranet
     or the WEB of a friend.

`auth'
     Authentification string.

`accept'
     Accepted MIME types specification for this Home Page. For more
     information about the values of this field refer to *Note MIME
     type filtering: MIME Filtering.

`filter'
     Filter applied on all hypertext links that will be crawled for this
     home page. Only the hypertext links that match the filter will be
     crawled. For more information about the values of this field refer
     to *Note Filtering the exploration of a WEB: Filtering Exploration.

`created'
     Date of creation of the record.

State information
=================

   The `info' field contains information set or read by the crawler
when exploring the Home Page. This field may contain a combination of
these values, separated by a comma. The meaning of the values is as
follows:

`unescape'
     If it is set, the crawler will restore the URL to it's `unescaped'
     form before querying the server. Some servers do not handle the
     `%xx' notation to specify a character.

`sticky'
     If it is set, the crawler will not give up on a WEB that timeouts.
     In normal operation, if a WEB has trouble responding in a
     reasonable delay, the crawler will skip it and continue with
     others.

`sleepy'
     If it is set, the crawler will sleep as soon as required. In
     normal operation the crawler will try to find some server that
     does not require sleeping. It may save some CPU when crawling a
     single WEB.

`nocookie'
     If set disables cookie handling.

`virgin'
     This flag is automaticaly set when a new entry is insert in the
     Home Page database. The crawler will then know that this entry has
     never been explored and take the appropriat action.

`exploring'
     When the crawler is exploring a Home Page, it sets this flag.  If
     the process is interrupted, it will then know that the crawl
     should be resumed.

`explored'
     The crawler sets this flag when it has finished to explore the
     Home Page. When it will consider this entry again, it will know
     that an update may be necessary.

`updating'
     When the crawler tries to update an already successfully explored
     Home Page, it sets this flag. When it found what needs to be
     crawled and what does not, it sets the exploring state and
     continue as if the Home Page was explored for the first time.

`in_core'
     This flag is automatically set on entries that are explored using
     the UNIX command line for the `crawler' command.

C langage interface to the crawler
**********************************

   `webbase' provides a C interface to the crawler. The following are a
guide to the usage of this interface.

Initialize the crawler
======================

   The initialization of the crawler is made by arguments, in the same
way the `main()' function of a program is initialized.  Here is a short
example:

           {
             crawl_params_t* crawl;
     
             char* argv[] = {
               "myprog",
               "-base", "mybase",
               "-no_hook",
               0
             };
             int argc = 4;
     
             crawl = crawl_init(argc, argv);
           }

   If the `crawl_init' function fails, it returns a null pointer.  The
`crawl' variable now hold an object that uses `mybase' to store the
crawl information. The `-base' option is mandatory.

Using the crawler
=================

   If you want to crawl a Home Page, use the `hp_load_in_core' function.
Here is an example:

     hp_load_in_core(crawl, "http://www.thisweb.com/");

   This function recusively explores the Home Page given in argument.
If the URL of the Home Page is not found in the `start' table, it will
be added. The `hp_load_in_core' does not return anything.  Error
conditions, if any, are stored in the entries describing each URL in
the `url' table.

   If you want to retrieve a specific URL that has already been crawled
use the `crawl_touch' function (this function must NOT be used to crawl
a new URL).  It will return a `webbase_url_t' object describing the
URL. In addition, if the content of the document is not in the `WLROOT'
cache, it will be crawled.  Here is an example:

     webbase_url_t* webbase_url = crawl_touch(crawl, "http://www.thisweb.com/agent.html");

   If you want to access the document found at this URL, you can get
the full pathname of the temporary file that contains it in the
`WLROOT' cache using the `url_furl_string' function.  Here is an
example:

     char* path = url_furl_string(webbase_url->w_url, strlen(webbase_url->w_url), URL_FURL_REAL_PATH);

   The function returns a null pointer if an error occurs.

   When you are finished with the crawler, you should free the `crawl'
object with the `crawl_free' function.  Here is an example:

     crawl_free(crawl);

   When error occurs all those functions issue error messages on the
`stderr' channel.

The webbase_url_t structure
===========================

   You will find the `webbase_url_t' structure in the `webbase_url.h'
header file, which is automaticaly included by the `crawl.h' header
file.

   The real structure of `webbase_url_t' is made of included
structures, however macros hide these details.  You should access all
the fields using the `w_<field>' macro such as:

     char* location = webbase_url->w_location;

`int w_rowid'
     The unique identifier allocated by `mysql'.

`char w_url[]'
     The URL of the Home Page, starting point of exploration.

`char w_url_md5[]'
     MD5 encoding of the url field.

`unsigned short w_code'
     HTTP code of the last crawl

`time_t w_mtime'
     Date of the last successfull crawl

`time_t w_mtime_error'
     Date of the last crawl with error (either fatal or transient)

`unsigned short w_tags'
     List of active tags for this URL

`char w_content_type[]'
     MIME type of the document contained in the URL

`unsigned int w_content_length'
     Total length of the document contained in the URL, not including
     headers.

`int w_complete_rowid'
     Value of the `rowid' field of an entry in the `url_complete'
     table. The `url_complete' is filled with information that can be
     very big such as the hypertext links contained in an HTML document.

`time_t w_crawl'
     Date of the next crawl, calculated according to the heuristics
     defined in the corresponding Home page.

`int w_hookid'
     Internal identifier used by the full text hook. When `mifluz' is
     used it is the value of the `rowid' field. If 0 it means that the
     document was not indexed.

`char w_extract[]'
     The first characters of the HTML document contained in the URL.

`char w_title[]'
     The first 80 characters of the title of the HTML document
     contained in the URL.

`char w_keywords[]'
     content of the `meta keywords' HTML tag, if any.

`char w_description[]'
     content of the `meta description' HTML tag, if any.

`char w_cookie[]'
     original cookie associated with the URL by the server, if any.

`char w_base_url[]'
     URL contained in the `<base>' HTML tag, if any.

`char w_relative[]'
     A white space separated list of relative URLs contained in the
     HTML document.

`char w_absolute[]'
     A white space separated list of absolute URLs contained in the
     document.

`char w_location'
     The URL of the redirection if the URL is redirected.

   The `webbase_url_t' structure holds all the information describing
an URL, including hypertext references. However, it does not contain
the content of the document.

   The `w_info' field is a bit field. The allowed values are listed in
the `WEBBASE_URL_START_*' defines. It is specially important to
understant that flags must be tested prior to accessing some fields
(w_cookie, w_base, w_home, w_relative, w_absolute, w_location). Here is
an example:

     if(webbase_url->w_info & WEBBASE_URL_INFO_LOCATION) {
       char* location = webbase_url->w_location;
       ...
     }

   If the corresponding flag is not set, the value of the field is
undefined.  All the strings are null terminated. You must assume that
all the strings can be of arbitrary length.

`FRAME'
     Set if URL contains a frameset.

`COMPLETE'
     Set if `w_complete_rowid' is not null.

`COOKIE'
     Set if `w_cookie' contains a value.

`BASE'
     Set if `w_base' contains a value.

`RELATIVE'
     Set if `w_relative' contains a value.

`ABSOLUTE'
     Set if `w_absolute' contains a value.

`LOCATION'
     Set if `w_location' contains a value.

`TIMEOUT'
     Set if the last crawl ended with a timeout condition.  Remember
     that a `timeout' condition may be a server refusing connection as
     well as a slow server.

`NOT_MODIFIED'
     Set if the last crawl returned a `Not Modified' code. The true
     value of the code can be found in the `w_code' field.

`NOT_FOUND'
     Set if the last crawl returned a `Not Found' code.

`OK'
     Set if the document contains valid data and is not associated with
     a fatal error. It may be set, for instance for a document
     associated with a timeout.

`ERROR'
     Set if the a fatal error occured. It may be `Not Found' or any
     other fatal error. The real code of the error can be found in the
     `w_code' field.

`HTTP'
     Set if the scheme of the URL is `HTTP'.

`FTP'
     Set if the scheme of the URL is `FTP'.

`NEWS'
     Set if the scheme of the URL is `NEWS'.

`EXTRACT'
     Set if the `w_extract' field contains a value.

`TITLE'
     Set if the `w_title' field contains a value.

`KEYWORDS'
     Set if the `w_keywords' field contains a value.

`DESCRIPTION'
     Set if the `w_description' field contains a value.

`READING'
     Temporarily set when in the read loop.

`TRUNCATED'
     Set if the data contained in the document was truncated because of
     a read timeout, for instance.

`FTP_DIR'
     Set if the document contains an `ftp' directory listing.

   The values of the `w_code' field are defined by macros that start
with `WEBBASE_URL_CODE_*'. Some artificial error conditions have been
built and are not part of the standard. Their values are between 600
and 610.

Compiling an application with the webbase library
=================================================

   A simple program that uses the crawler functions should include the
`crawl.h' header file. When compiling it should search for includes in
/usr/local/include and /usr/local/include/mysql. The libraries will be
found in /usr/local/lib and /usr/local/lib/mysql.  Here is an example:

     $
     $ cc -c -I/usr/local/include \
       -I/usr/local/include/mysql myapp.c
     $ cc -o myapp -L/usr/local/lib -lwebbase -lhooks -lctools \
       -L/usr/local/lib/mysql -lmysql myapp.o
     $

   The libraries `webbase', `hooks', `ctools' and `mysql' are all
mandatory. If using *mifluz* you'll have to add other libraries, as
specified in the *mifluz* documentation.

A more in-depth view of the crawler
***********************************

Conventions used in figures
===========================

   The algorithms presented in this chapter are centered on functions
and functionalities. Arrows between rectangle boxes figure the functions
calls. When a function successively calls different functions or
integrate different functionalities, arrows are numbered to show the
order of execution. When a function is called, the name of the function
is written in the rectangle box. Either, a short comment explains what
is being done :
     save cookie in DB
   A rectangle box with rounded corners figure calls inside the module.
A "normal" rectangle box figure calls outside the module.

How the crawler works
=====================

   The following figures show the algorithms implemented by the
crawler. In all the cases, the crawler module is the central part of
the application, it is why the following figures are centered on it.

Crawl a virgin URL
------------------

   The following figure presents what is done when a new URL is crawled.

 Crawl a list of URLs
--------------------

   The following figure presents a crawl from a list of URLs.

 Rebuild URLs
------------

   The following figure presents the rebuilding of an existing URL.

 Where the work is done
======================

crawler
-------

description
...........

   it is main module of the application. it centralizes all the
treatments. it is first used for crawl parameters initialization then
manages the crawling. Some of the algorithms used in this module are
presented in the previous section.

structure
.........

 cookies
-------

description
...........

   This module handles cookie related stuff :
     	

   * retrieve cookie from DB into memory struct 	

   * load/store cookie in databaseh 	

   * parse HTTP cookie string 	

   * find cookies, if any, to be used for URLs

algorithms
..........

   The `cookie_match' function is called by the crawler when building
outgoing requests to send to HTTP servers. If it finds it, the
`cookie_match' returns the cookie that must be used.

 dirsel
------

description
...........

   This module handles robots.txt and user defined Allow/Disallow
clauses.

algorithms
..........

   The `dirsel' module has two main functions :
     	

   * build a list of strings to add to user Allow/Disallow clauses
     (`dirsel_allow' function) 	

   * during the exploration of HTTP servers, verify that URLs are
     allowed, according to robots.txt file and user defined clauses
     (`dirsel_allowed' function)

  http
----

description
...........

   The `http' module is used by `webtools'. When it reads an HTTP
header or body, the `webtools' module calls `http_header' or
`http_body' functions (as callback functions) to manage information
contained in them.  Information extracted from header or body of pages
are stored in a `webbase_url_t' struct.  The `html_content_begin'
initializes an `html_content_t' structure then calls
`html_content_parse' function to parse the body.

algorithms
..........

  robots
------

description
...........

   The robots module is in charge of the Robot Exclusion Protocol.

structures
..........

 algorithms
..........

   The `robots_load' function is used to create Allow/Disallow strings
from information contained in robots.txt files. This function first
looks if the information is contained in the current uri object. If not,
it tries to find information in the database. Finally, if it has not
found it, it crawls the robots.txt file located at the current URL (and
if this file does not exist, no Allow/Disallow clauses are created).

 webbase
-------

description
...........

   The `webbase' module is an interface to manipulate three of the most
important tables of webbase database :
   * start

   * url

   * start2url
   These tables are described in *Note Meta database:: and *Note Home
Pages::.

structures
..........

   The `webbase_t' structure mainly contains parameters used to connect
to the database. The other modules who need an access to the database
use a `webbase_t' to store database parameters (*robots*, *cookies*,
...).

 webbase_url
-----------

description
...........

   The *webbase_url* module mainly manages the `webbase_url_*_t'
structures. These structures are described in next paragraph and in
*Note webbase_url_t::.

structures
..........

  webtools
--------

description
...........

   The `webtools' module manages the connection between the local
machine and the remote HTTP server. That is to say :
   * Open and close connection between the two machines

   * Read and write data in the socket
   When webtools reads data from the socket, it first looks at their
HTTP type (header or body) and calls the appropriate parse function in
the *http* module.

structures
..........

 algorithms
..........

 Relationships between structures
================================

 Examples about how to use the crawler
=====================================

   Some examples on how to use the crawler on command line. That is to
say how to use options, which options need other options, which options
are not compatible, ...  Crawl a single URL :
     crawler -base nom_base -no_hook -- http://url.to.crawl
   rebuild option : rebuild URLs (remove all the records from the full
text database and resubmit all the URLs for indexing, *mifluz* module
mandatory).
     crawler -base nom_base -rebuild
   rebuild a set of URLs (the `where_url' is only used when associated
with `rebuild' option)
     crawler -base nom_base -rebuild -where_url "url regex 'regex_clause'"(1)
   unload option : remove the starting point and all the URLs linked to
it.
     crawler -base nom_base -no_hook -unload -- http://url.to.unload
   unload_keep_start : same as unload except that starting point is
left in DB.
     crawler -base nom_base -no_hook -unload_keep_start -- http://url.to.unload
   home_pages : load all the URLs listed in the start table
     crawler -base nom_base -no_hook -home_pages
   schema : obtaining a schema of the database
     crawler -base nom_base -schema

   ---------- Footnotes ----------

   (1) don't forget to specify the entire where clause

Index of Concepts
*****************

Absolute URLs:
          See ``What is part of a WEB and what is not.''.
Accuracy and proxies:
          See ``Proxy handling''.
C structures:
          See ``Relationships between structures''.
Cannonical form of an URL:
          See ``File equivalent of a URL''.
Cannonicalization and rejection:
          See ``Crawler recursion''.
cookies:
          See ``cookies''.
crawl : rebuild URLs:
          See ``Rebuild URLs''.
crawl a list of URLs:
          See ``Crawl a list of URLs''.
crawl a virgin URL:
          See ``Crawl a virgin URL''.
crawler:
          See ``crawler''.
Crawler <1>:
          See ``The crawler that mirrors the web''.
Crawler <2>:
          See ``The crawler that fetches WEB documents''.
Crawler:
          See ``Overview of the concepts used by `webbase'''.
crawler algorithms:
          See ``How the crawler works''.
Crawler and full text:
          See ``Using a full text database''.
Crawler boudaries:
          See ``What is part of a WEB and what is not.''.
Crawler definition:
          See ``Using a full text database''.
Crawler heuristics:
          See ``Heuristics''.
Crawler implemented in crawler command:
          See ``Running the crawler''.
Crawler recursion:
          See ``Crawler recursion''.
Crawler starting points:
          See ``Home Pages Database''.
Crawling Home Pages:
          See ``The first round''.
Crawling starting points:
          See ``The first round''.
Database stability:
          See ``Using a full text database''.
dirsel:
          See ``dirsel''.
Editorial capabilities:
          See ``Homes Pages database''.
Escape server boundaries:
          See ``Filtering the exploration of a WEB''.
examples:
          See ``Examples about how to use the crawler''.
Extraction of links in an HTML document:
          See ``Crawler recursion''.
Fatal errors:
          See ``Heuristics''.
Fatal errors delay:
          See ``Heuristics''.
File extensions and MIME types:
          See ``MIME type filtering''.
File system copy of URLs:
          See ``File equivalent of a URL''.
Filter crawled URLs:
          See ``Filtering the exploration of a WEB''.
Filter example:
          See ``Filtering the exploration of a WEB''.
Firewall:
          See ``Proxy handling''.
First round:
          See ``The first round''.
ftp_proxy:
          See ``Proxy handling''.
Full Text database:
          See ``Overview of the concepts used by `webbase'''.
FURL definition:
          See ``File equivalent of a URL''.
FURL structure:
          See ``File equivalent of a URL''.
Global proxy:
          See ``Proxy handling''.
Home Pages database <1>:
          See ``Home Pages Database''.
Home Pages database:
          See ``Overview of the concepts used by `webbase'''.
Home Pages database definition:
          See ``Homes Pages database''.
Home Pages database structure:
          See ``Home Pages database storage''.
http:
          See ``http''.
Http proxy:
          See ``Proxy handling''.
http_proxy:
          See ``Proxy handling''.
Increase the maximum size of an URL:
          See ``Panic prevention''.
Indexing hooks:
          See ``Indexing the documents''.
Indexing retrieved URLs:
          See ``Indexing the documents''.
Language Recognition:
          See ``Using language recognition functionality''.
Long URLs:
          See ``Panic prevention''.
Maximum document size:
          See ``Panic prevention''.
Memory and disk requirements:
          See ``Memory and disk space considerations''.
Meta database:
          See ``Meta database''.
MIME type filtering:
          See ``MIME type filtering''.
Monitoring catalogs:
          See ``Good planning is crucial''.
mysql:
          See ``Meta database''.
Navigable domain:
          See ``Web Bounds''.
no_proxy:
          See ``Proxy handling''.
Not Found error is transient:
          See ``Heuristics''.
Not modified heuristics:
          See ``Heuristics''.
Panic prevention:
          See ``Panic prevention''.
Planning design hints:
          See ``Good planning is crucial''.
Proxy:
          See ``Proxy handling''.
Proxy accuracy:
          See ``Proxy handling''.
Proxy global:
          See ``Proxy handling''.
RAM requirements:
          See ``Memory and disk space considerations''.
Recursive crawling:
          See ``The crawler that fetches WEB documents''.
Rejecting non cannonical URLs:
          See ``Crawler recursion''.
Relative URLs:
          See ``What is part of a WEB and what is not.''.
Releasing crawler boundaries:
          See ``Filtering the exploration of a WEB''.
Robot exclusion:
          See ``Robot Exclusion''.
robots:
          See ``robots''.
robots.txt file:
          See ``Robot Exclusion''.
Socks:
          See ``Proxy handling''.
Successfuly loaded heuristics:
          See ``Heuristics''.
Trace verbosity:
          See ``Traces and debug help the manager''.
Tracing commands:
          See ``Traces and debug help the manager''.
Transient errors:
          See ``Heuristics''.
Transient errors delay:
          See ``Heuristics''.
Unreachable documents:
          See ``Web Bounds''.
Update frequency:
          See ``Using a full text database''.
Updating known URLs:
          See ``Good planning is crucial''.
URL parts and crawl boundaries:
          See ``The first round''.
url table:
          See ``url table''.
URLs graph:
          See ``Web Bounds''.
webbase:
          See ``webbase''.
webbase_url:
          See ``webbase_url''.
webtools:
          See ``webtools''.

...Table of Contents...
