dc.bot - The ADAM Web Resource Harvester

Home
   

Background

dc.bot arose through an ADAM project requirement to harvest known sites and extract specific metadata to be used for site searching. After examining a number of possible tools, and finding that no one package met our full requirements, we decided to write our own harvesting tool.

The tool is based on a well known mirroring package called lwp-rget.

dc.bot allows you to recursively grab the Dublin Core metadata from a site and store in your required file format; it produces a record for each page encountered.

Using what we call a 'map file' it is possible to configure your own file format, within specific constraints..

Currently, dc.bot writes either ROADS format or Index+ format records; ideally this would be expanded in a generic fashion to cope with many different formats.

dc.bot conforms to the standard (actually an expired Internet draft), for web robot exclusion.

Usage

Once you have dc.bot installed, simple typing 'dc.bot' will notify you of the parameters allowed in calling the robot :-

Usage: dc.bot [options] 
Allowed options are:
  --verbose         Produces output
  --check           Perform link check only 
                    (turns on verbose)
  --sleep=SECS      Sleep between gets, 
                    ie. go slowly
  --raw             Write raw records to 
                    STDOUT (turns off roads, 
                    turns off verbose)
  --map=string      Use the defined map file
                    (defaults to default.map)
  --dir=string      Set the directory where 
                    the record files will be 
                    stored (defaults to ./files)
  --help            Produces this usage summary

Here are some examples of how to use dc.bot :-

    # creates records based on 
    # default.map, in ./files directory
    dc.bot http://adam.ac.uk/adam/

    # performs a link check, does not 
    # write files
    dc.bot --check http://adam.ac.uk/nominate/

    # uses map file 'fred.map' to write records 
    # to ./new directory, displaying results
    # each record is created 1 second apart
    dc.bot --map=fred.map --dir=./new --verbose 
           --sleep=1 http://adam.ac.uk/nominate/

    # sends a list of raw data to STDOUT, 
    # for later processing
    dc.bot --raw http://adam.ac.uk/nominate/

Requirements

dc.bot should run on any UNIX machine that has PERL and the LWP module installed. The LWP module can be found at your local PERL archive (CPAN).

Contraints

dc.bot only uses NAME and CONTENT entries for DC metadata.

Map file format

The first non-comment line has to contain either 'index+' or 'roads'. Following lines consist of 5 fields, separated by a ':'. (ROADS map files only make use of the first 3 fields). Any whitespace is ignored. The fields are as follows :-
  • Field
  •    
  • Type
  •   (Valid values are 'R', 'S', 'M' or 'K')
  • Keywords
  •    
  • Itype
  •   Contains a valid type for the field (used with index+ only)
  • Ivalue
  •   Contains a valid type value for the field (used with index+ only)
     

    The types are as follows :-

    • R
    •   Raw field, prints field and keywords with no parsing
    • S
    •   Parses keywords, displays only first value found
    • M
    •   Parses keywords, shows all values found
    • K
    •   Strips (using ';' and ',') all values and displays them independently (index+ only)
       

      If the Keywords field contains a comma separated list, the metadata from the harvested resource is examined to find the first matching keyword, any other keywords are ignored.

      Valid keywords are :-

      • doc
      •   - source (stripped of HTML) of harvested resource
      • url|uri
      •   - url of harvested resource
      • title
      •   - content of title HTML tag
      • description
      •   - content of Alta Vista description meta tag
      • keywords
      •   - content of Alta Vista keywords meta tag
      • content-type
      •   - content type of harvested resource
      • filename
      •   - filename for record
      • fullfilename
      •   - complete path and filename for record
      • indextimestamp
      •   - harvested date, in INDEX+ format
      • roadstimestamp
      •   - harvested date, in ROADS format
      • id
      •   - ID tag to use in all harvested records
      • DC.{value}
      •   - Dublin Core metadata
         

        Anything after a '#' is treated as a comment.

        Example Map File


        -------- Example begins ------------
        # Example map file to produce ROADS records
        
        roads # create ROADS records
        
        # each record will contain the following fields
        
        Template-Type   : R : DOCUMENT
        Handle          : S : filename
        Category        : S : dc.type
        Title           : S : DC.title, title
        URI-v1          : S : uri
        Author-name     : M : dc.creator
        Author-email    : M : dc.creator.email
        Source          : S : dc.source
        Description     : S : dc.description, description
        Publisher-name  : M : dc.publisher
        Publisher-email : M : dc.publisher.email
        Creation-Date   : S : dc.date
        Keywords        : S : dc.subject, keywords
        Format          : M : content_type, dc.format
        Language        : M : dc.language
        ISBN            : M : dc.identifier.isbn
        ISSN            : M : dc.identifier.issn
        Comments        : R : Created by automatic harvesting 
        

        -------- Example ends ------------

        Producing raw data

        The raw data produced by dc.bot has a specfic format too. Each line has the format :-
        metadata-name<SEPARATOR>metadata-value
        
        Where the separator is a '+' (plus) sign. Additionally the metadata-name consists of two parts separated by a '-' (minus) sign, the first part of this is the order in which the metadata tag existed in the resource file, and the second part is the actual metadata name tag.

        One line of raw data might look like this :-

        3-dc.creator+Mark Burrell
        
        This means that the 3rd metadata entry in the resource file was a DC.creator tag, and its value was Mark Burrell. If the order value is set to 0 (zero) then it means it is an extra keyword added from information other than the DC metadata available in the resource.

        Files available for downloading

        As normal, no liability accepted. Use at your own risk... blah, blah. Happy harvesting.
        • dc.bot
        • The web harvesting program itself
        • dc.bot.data
        • An example perl program that grabs dc.bot raw data
        • dc.bot.wrapper
        • A perl program that uses a file containing multiple URLs and feeds them to dc.bot one at a time
        • roads.cfg
        • A sample ROADS configuration map file
        • index+.cfg
        • A sample INDEX+ configuration map file

        Further Information

        If you require further information, or think you have found a bug, or merely want to send your comments on dc.bot, then please send email to Mark Burrell, ADAM and VADS Technical Officer.
About ADAM
Power Search
Friends of ADAM
Nominate a site
Site map & Search

© Surrey Institute of Art & Design on behalf of the ADAM Consortium.
Conditions of Use
are available
.