|
||||
Backgrounddc.bot arose through an ADAM project requirement to harvest known sites and extract specific metadata to be used for site searching. After examining a number of possible tools, and finding that no one package met our full requirements, we decided to write our own harvesting tool.The tool is based on a well known mirroring package called lwp-rget. dc.bot allows you to recursively grab the Dublin Core metadata from a site and store in your required file format; it produces a record for each page encountered. Using what we call a 'map file' it is possible to configure your own file format, within specific constraints.. Currently, dc.bot writes either ROADS format or Index+ format records; ideally this would be expanded in a generic fashion to cope with many different formats. dc.bot conforms to the standard (actually an expired Internet draft), for web robot exclusion. UsageOnce you have dc.bot installed, simple typing 'dc.bot' will notify you of the parameters allowed in calling the robot :-
Usage: dc.bot [options] Here are some examples of how to use dc.bot :-
# creates records based on
# default.map, in ./files directory
dc.bot http://adam.ac.uk/adam/
# performs a link check, does not
# write files
dc.bot --check http://adam.ac.uk/nominate/
# uses map file 'fred.map' to write records
# to ./new directory, displaying results
# each record is created 1 second apart
dc.bot --map=fred.map --dir=./new --verbose
--sleep=1 http://adam.ac.uk/nominate/
# sends a list of raw data to STDOUT,
# for later processing
dc.bot --raw http://adam.ac.uk/nominate/
Requirementsdc.bot should run on any UNIX machine that has PERL and the LWP module installed. The LWP module can be found at your local PERL archive (CPAN).Contraintsdc.bot only uses NAME and CONTENT entries for DC metadata.Map file formatThe first non-comment line has to contain either 'index+' or 'roads'. Following lines consist of 5 fields, separated by a ':'. (ROADS map files only make use of the first 3 fields). Any whitespace is ignored. The fields are as follows :-
The types are as follows :-
If the Keywords field contains a comma separated list, the metadata from the harvested resource is examined to find the first matching keyword, any other keywords are ignored. Valid keywords are :-
Anything after a '#' is treated as a comment. Example Map File-------- Example begins ------------ # Example map file to produce ROADS records roads # create ROADS records # each record will contain the following fields Template-Type : R : DOCUMENT Handle : S : filename Category : S : dc.type Title : S : DC.title, title URI-v1 : S : uri Author-name : M : dc.creator Author-email : M : dc.creator.email Source : S : dc.source Description : S : dc.description, description Publisher-name : M : dc.publisher Publisher-email : M : dc.publisher.email Creation-Date : S : dc.date Keywords : S : dc.subject, keywords Format : M : content_type, dc.format Language : M : dc.language ISBN : M : dc.identifier.isbn ISSN : M : dc.identifier.issn Comments : R : Created by automatic harvesting -------- Example ends ------------ Producing raw dataThe raw data produced by dc.bot has a specfic format too. Each line has the format :-metadata-name<SEPARATOR>metadata-valueWhere the separator is a '+' (plus) sign. Additionally the metadata-name consists of two parts separated by a '-' (minus) sign, the first part of this is the order in which the metadata tag existed in the resource file, and the second part is the actual metadata name tag. One line of raw data might look like this :- 3-dc.creator+Mark BurrellThis means that the 3rd metadata entry in the resource file was a DC.creator tag, and its value was Mark Burrell. If the order value is set to 0 (zero) then it means it is an extra keyword added from information other than the DC metadata available in the resource.
Files available for downloadingAs normal, no liability accepted. Use at your own risk... blah, blah. Happy harvesting.
Further InformationIf you require further information, or think you have found a bug, or merely want to send your comments on dc.bot, then please send email to Mark Burrell, ADAM and VADS Technical Officer. |
|
| © Surrey Institute of Art & Design on behalf of the ADAM
Consortium. Conditions of Use are available. |