Skip to content

"A cynic is a man who knows the price of everything, and the value of nothing." (Oscar Wilde)

You are here: HOME arrow COINS Spider
COINS Spider Installation PDF Print Email

HARDWARE REQUIREMENTS

    The COINS Spider runs well on any reasonably fast computer (eg. 1.5+Ghz processor). Memory requirements depend on how many parallel threads you are going to run. A minimum of 256MB of RAM is required to run a normal crawl, but 512MB RAM is the recommended size if you want to perform wide crawls.

SYSTEM REQUIREMENTS

    COINS Spider crawler was built and tested primarily on Linux using a Debian/Ubuntu Server distribution. It has seen some informal use on Macintosh, but is not tested, packaged, nor supported on platforms other than Linux at this time.
  1. (Required) Sun JDK 1.4 or higher. You can download the Java SE Development Kit (JDK) from the Sun Website
  2. (Required) Apache Webserver 2.0 or higher. You can download and install it directly from the official Apache Website. In a Debian/Ubuntu systems you can simply install it using the apt utility:
     apt-get install apache2
  3. (Required) ImageMagick, a set of graphic libraries that we used for the image filtering operations to provide a coherent set of pictures to the COINS image recognition tools. The libraries can be downloaded from the official ImageMagick Website or using the apt utility of Debian-based Linux distributions:
    apt-get install imagemagick
  4. (Optional) Lynx, a text-mode web browser that we used to create custom seeds starting from a set of keywords. If you prefere to set your seeds manually you can avoid installing it. Version 2.8.6 is available from at this link. In Debian/Ubuntu systems simply type
    apt-get install lynx

GETTING STARTED

    To start using the application you should first of all download the latest release from the COINS Website and unpack it in a local folder of your hard drive. After unpacking, a COINS_Spider folder will be created. It contains all the necessary scripts and codes to use the tool. No installation is required. The application will be up and running after a few configuration files are set.

SPIDER CONFIGURATION

    In order to properly configure the COINS Spider to crawl the web looking for images of coins, you should enter the COINS_Spider/spider/ folder and set the following configuration parameters:

01 Edit the bin/runbot.sh file and fill all the parameters of the "USER CONFIGURATION SECTION" (on top of the file). In particular:

  1. Set the JAVA_HOME variable to the root of your JDK installation;
  2. Set the APACHE_HOME variable to the public folder fo your Apache Webserver installation;
  3. Set the NUTCH_HOME variable to the absolute path of the COINS_Spider/spider folder;
  4. Set the images_folder variable. This is the folder in which you will find the images of coins collected once the crawling operations have finished;
  5. Set the threads variable. It determines the number of threads that will fetch in parallel. If your system has 512MB RAM you should keep this value between 10 and 20 to avoid memory leaks or Java out of memory errors. You can increase this value if more memory is available;
  6. Set the depth variable. It indicates the link depth from the root page that should be crawled. For large crawls you can consider a value varying between 50 to 100;
  7. Set the topN variables to determine the maximum number of pages that will be retrieved at each level up to the depth. You can leave this value blank to ignore the amount of pages fetched and to make the crawl dependant only on the link depth;

02 Edit the params/seeds.txt file. It contains a list of URLs that will be used by the spider as "seeds", the starting point for every crawl operation.

03 Edit the params/keys.txt (optional). You can use this file to add a list of keyword-based, dynamically generated seeds to the flat list contained in the seeds.txt file. Just insert all the keywords you need and the tool will automatically generate the new seeds (usually 100) at runtime.

That's all. Now everything is ready for the crawling.

RUNNING THE SPIDER

To start the crawl simply invoke the main script from the COINS_Spider/spider folder:
bin/runbot.sh
    If everything was set properly, you should see the tool running and the results of every fetching/indexing operation on your screen. You can also have more detailed information about what's going on by looking at the COINS_spider/spider/logs/hadoop.log file.

THE SEARCH INTERFACE

    At the end of the spidering operations, you will find the crawled images in the folder that you specified in the images_folder variable. You can also retrieve specific images by keywords using the COINS Spider search interface at http://localhost:8087/nutch/en/. Just type some keywords in the search field and hit enter. A list of images and links to their original websites should appear as the result of your query.

FILTERING IMAGES

    The COINS_Spider/spider/bin/ folder also contains a set of filtering scripts written to increase interoperability among the spider tool and the image recognition tools developed as part of the COINS project. Filtering scripts can be used to provide a coherent set of ancient coin images to be used in the recognition process. You can create a filtered set of coin images just by typing:
bin/filter_images.sh
    This script will create a COINS_Spider/spider/filtered_images/ folder filled with the relevant images of coins. Then you can use this folder with the COINS image recognition tools (for instance making it available to Microsoft Windows applications through SAMBA).
 

Europe Flag
THE COINS PROJECT
COINS is funded by the European Commission under the Community's Sixth Framework Programme, contract no. 044450. However, this site reflects only the authors' views and the European Community is not liable for any use that may be made of the information contained herein.
[+]
  • Narrow screen resolution
  • Wide screen resolution
  • Auto width resolution
  • Increase font size
  • Decrease font size
  • Default font size
  • hot color
  • natural color