|
COINS Spider Installation |
|
|
|
HARDWARE REQUIREMENTS
The COINS Spider runs well on any reasonably fast computer (eg. 1.5+Ghz processor). Memory requirements depend on how many parallel threads you are going to run. A minimum of 256MB of RAM is required to run a normal crawl, but 512MB RAM is the recommended size if you want to perform wide crawls.
SYSTEM REQUIREMENTS
COINS Spider crawler was built and tested primarily on Linux using a Debian/Ubuntu Server distribution. It has seen some informal use on Macintosh, but is not tested, packaged, nor supported on platforms other than Linux at this time.
- (Required) Sun JDK 1.4 or higher. You can download the Java SE Development Kit (JDK) from the Sun Website
- (Required) Apache Webserver 2.0 or higher. You can download and install it directly from the official Apache Website. In a Debian/Ubuntu systems you can simply install it using the apt utility:
apt-get install apache2
- (Required) ImageMagick, a set of graphic libraries that we used for the image filtering operations to provide a coherent set of pictures to the COINS image recognition tools. The libraries can be downloaded from the official ImageMagick Website or using the apt utility of Debian-based Linux distributions:
apt-get install imagemagick
- (Optional) Lynx, a text-mode web browser that we used to create custom seeds starting from a set of keywords. If you prefere to set your seeds manually you can avoid installing it. Version 2.8.6 is available from at this link. In Debian/Ubuntu systems simply type
apt-get install lynx
GETTING STARTED
To start using the application you should first of all download the latest release from the COINS Website and unpack it in a local folder of your hard drive. After unpacking, a COINS_Spider folder will be created. It contains all the necessary scripts and codes to use the tool. No installation is required. The application will be up and running after a few configuration files are set.
SPIDER CONFIGURATION
In order to properly configure the COINS Spider to crawl the web looking for images of coins, you should enter the COINS_Spider/spider/ folder and set the following configuration parameters:
01 Edit the bin/runbot.sh file and fill all the parameters of the "USER CONFIGURATION SECTION" (on top of the file). In particular:
- Set the JAVA_HOME variable to the root of your JDK installation;
- Set the APACHE_HOME variable to the public folder fo your Apache Webserver installation;
- Set the NUTCH_HOME variable to the absolute path of the COINS_Spider/spider folder;
- Set the images_folder variable. This is the folder in which you will find the images of coins collected once the crawling operations have finished;
- Set the threads variable. It determines the number of threads that will fetch in parallel. If your system has 512MB RAM you should keep this value between 10 and 20 to avoid memory leaks or Java out of memory errors. You can increase this value if more memory is available;
- Set the depth variable. It indicates the link depth from the root page that should be crawled. For large crawls you can consider a value varying between 50 to 100;
- Set the topN variables to determine the maximum number of pages that will be retrieved at each level up to the depth. You can leave this value blank to ignore the amount of pages fetched and to make the crawl dependant only on the link depth;
02 Edit the params/seeds.txt file. It contains a list of URLs that will be used by the spider as "seeds", the starting point for every crawl operation.
03 Edit the params/keys.txt (optional). You can use this file to add a list of keyword-based, dynamically generated seeds to the flat list contained in the seeds.txt file. Just insert all the keywords you need and the tool will automatically generate the new seeds (usually 100) at runtime.
That's all. Now everything is ready for the crawling.
RUNNING THE SPIDER
To start the crawl simply invoke the main script from the COINS_Spider/spider folder:
bin/runbot.sh
If everything was set properly, you should see the tool running and the results of every fetching/indexing operation on your screen. You can also have more detailed information about what's going on by looking at the COINS_spider/spider/logs/hadoop.log file.
THE SEARCH INTERFACE
At the end of the spidering operations, you will find the crawled images in the folder that you specified in the images_folder variable. You can also retrieve specific images by keywords using the COINS Spider search interface at http://localhost:8087/nutch/en/.
Just type some keywords in the search field and hit enter. A list of images and links to their original websites should appear as the result of your query.
FILTERING IMAGES
The COINS_Spider/spider/bin/ folder also contains a set of filtering scripts written to increase interoperability among the spider tool and the image recognition tools developed as part of the COINS project. Filtering scripts can be used to provide a coherent set of ancient coin images to be used in the recognition process. You can create a filtered set of coin images just by typing:
bin/filter_images.sh
This script will create a COINS_Spider/spider/filtered_images/ folder filled with the relevant images of coins. Then you can use this folder with the COINS image recognition tools (for instance making it available to Microsoft Windows applications through SAMBA).
|