Documentation

Document Version 1.0.1.0
Copyright © 2013 Narrowteq

No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without the prior written permission of Narrowteq.

Narrowteq MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OR MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE.

Darcy Ripper IS A REGISTERED TRADEMARK OF Narrowteq, OTHER PRODUCT NAMES AND SERVICE NAMES ARE THE TRADEMARKS OR REGISTERED TRADEMARKS OF THEIR RESPECTIVE OWNERS AND ARE USED FOR IDENTIFICATION ONLY.


The Darcy Ripper user manual is organized as it follows:


About


Darcy Ripper is a powerful pure Java multi-platform web crawler (web spider) with great work load and speed capabilities. This is a standalone multi-platform Graphical User Interface application that can be used by simple users as well as programmers to download web resources on the fly.

Based on proven Java technology, the intuitive Darcy GUI is easy-to-use and provides robust functionality for creating and running simple or complex download jobs.


Features and Benefits


Darcy Ripper offers a large list of features that will enhance the efficiency of the download process as far as the processing time, network time, memory used and accuracy go.

Graphical User Interface

  1. Multi-platform;
  2. Real-time view of the download job progress;
  3. Pause/Resume/Stop download job any time;
  4. Save and Load download job template files;
  5. Regular Expression Tester;
  6. Check for Updates support;
  7. Online Help and support.

General Download Features

  1. Multi-threaded – configurable number of parallel download jobs to run at a certain period of time;
  2. Memory control options – user can control what happens to download jobs after they finish;
  3. Multiple starting points (URLs) for download job -€“ user can specify multiple hosts on which a download job can run.

HTTP Connection Features

  1. HTTP/HTTPS support;
  2. GZip compression support;
  3. HTTP Proxy support;
  4. WWW Authentication support;
  5. Cookies support;
  6. Request customization support: referral behaviour, configurable agent name;
  7. HTTP response code analysis and configurable behaviour;
  8. Connection limits support – number of maximum connections per server, retries number control, bandwidth limitation, limitation depending on the HTTP response code.

Download Control Features

  1. Maximum search depth support;
  2. Maximum number of followed links support;
  3. Maximum time limit support;
  4. Downloaded file size support;
  5. Followed URL prefix support;
  6. Hostname limitation support;
  7. Save to Disk limitation support;
  8. Response behaviour limitation matching response header with regular expressions;
  9. Response behaviour limitation matching response content with regular expressions;
  10. Downloaded file content limitation support.

GUI Overview


Darcy Ripper offers an intuitive and robust interface that makes it easier to create, load and run download jobs (Job Packages) in a transparent and secure manner.

The following sections are available:


Menu/Tool bar


File Menu

The File menu includes the following commands:

  • New – Creates a new Job Package and launches the Job Package Configuration dialog;
  • Open… – Opens an existing Job Package file and loads its configuration;
  • Edit – Opens the Job Package Configuration dialog for the current selected Job Package;
  • Save – Saves the current selected Job Package to a file;
  • Save As… – Saves the current select Job Package to a different file;
  • Save All – Saves all the opened Job Packages to files;
  • Close – Closes the current selected Job Package;
  • Close All- Closes all the opened Job Packages;
  • Exit- Exits the application.

Job Menu

The Job menu includes the following commands:

  • Start – Starts processing the download of the current selected Job Package;
  • Pause – Pauses the current running download process;
  • Stop – Stops the current running download process;
  • Clear – Clears the data associated with the current selected Job Package;
  • Last Statistics – Retrieves the last statistics available for the current selected Job Package.

Utilities Menu

The Utilities menu includes the following commands:

  • Regular Expressions Editor – Starts the regular expressions editor dialog.

Help Menu

The Help menu includes the following commands:

  • Help – Launches the Darcy Ripper Help dialog;
  • Send Feedback – Opens the default system browser and launches the Darcy Ripper Feedback URL;
  • Check for Updates… – Checks if there are any Darcy Ripper updates available for download;
  • About – Provides a few details regarding the current Darcy Ripper application.

Tool bar

The application’s tool bar contains the following commands:

  • New – Creates a new Job Package and launches the Job Package Configuration dialog;
  • Open… – Opens an existing Job Package file and loads its configuration;
  • Edit – Opens the Job Package Configuration dialog for the current selected Job Package;
  • Save – Saves the current selected Job Package to a file;
  • Save As… – Saves the current select Job Package to a different file;
  • Save All – Saves all the opened Job Packages to files;
  • Start – Starts processing the download of the current selected Job Package;
  • Pause – Pauses the current running download process;
  • Stop – Stops the current running download process;
  • Clear – Clears the data associated with the current selected Job Package;
  • Last Statistics – Retrieves the last statistics available for the current selected Job Package.

Job Package Overview


This main window section gives the user an overview of the Job Package configuration as well as the entire Job Package download process.
The following sections are available:

  • Configuration – Defines a summary of the Job Package Settings;
  • In Progress – Displays all the Job Package download connections that are being processed at a certain moment of time. A connection that is processed may be a page that is being downloaded or a page that was already downloaded and it is being processed in order for the links to be extracted;
  • Opened – Displays all the Job Package download connections that are opened at a certain moment of time. A connection that is opened refers to a page that is downloaded at that particular moment of time;
  • Finished – Displays all the Job Package processed pages (downloads);
  • All – Displays all the Job Package download connections.

Utilities



History

This facility makes it easier for the user to examine past statistics obtained by running Job Packages. This section contains all the history of Job Packages and each of this processes may be analyzed in detail by double-clicking them.


Regular Expressions Editor

This internal tool makes it easier for the user to control the regular expression that he uses in the Job Package configuration process.

Regular expressions syntax

Examples

.*sometext.*

Fully matches every line containing the text “sometext”.

com

Fully matches every line containing the text “sometext”.
Matches: “http://www.darcyripper.com”.
Does not match: “http://www.darcyripper.org/download.html”.

.*\.com$

Fully matches every line ending with “.com”.
Matches: “http://www.darcyripper.com”.
Does not match: “http://www.darcyripper.com/download.html”.

.*\.com$|.*\.org$

Fully matches every line ending with “.com” or “.org”.
Matches: “http://www.darcyripper.com”.
Does not match: “http://www.darcyripper.com/download.html”.

Special characters:

\ Indicates that the next character is not special and should be interpreted literally;
. Any character except newline;
\. A period (and so on for *, (, \, etc.);
? Zero or one of the preceding element;
* Zero or more of the preceding element;
+ One or more of the preceding element;
^ The start of the string;
$ The end of the string;
\d,\w,\s A digit, word character [A-Za-z0-9_], or whitespace;
\D,\W,\S Anything except a digit, word character, or whitespace.

Check For Updates


Darcy Ripper offers the possibility of checking if any newer versions are available for download.

We encourage you to check for Darcy Ripper versions from time to time. This will ensure that your application version has the latest features and capabilities.

If a newer version is available, you can choose to download (install) the update.

Note: On some operating systems special write permission may be needed in order for your update to work as expected. Before downloading an update, please ensure that you have sufficient access rights to install it. A password, usually the administrator’s or root password, may be required.


Job Package Configuration


Darcy Ripper supports many settings by means of which the Job Package download process can be controlled and limited.

The following sections are available:

  1. Basic Settings
  2. Connection
  3. Custom Rules

Basic Settings


Specify here basic Job Package properties to be used for managing the download job.
The available primary settings are:

  • Name – Defines the Job Package name. Multiple Job Packages can have the same name but we do not recommend this approach beacause as it will lead to confusion in organizing Job Packages. This is a mandatory field;
  • URL(s) – One (or more) URL from which the Job Package will start its processing. The URL(s) specified here must be valid (according to the RFC #3986) otherwise Darcy will signal the invalidity with an error. Multiple URLs can be added by pressing the “Add…” button. This is a mandatory field;
  • Save Path – The absolute path of the directory where downloaded resources must be saved. This is a mandatory field.


Concurrency Settings

  • Parallel Downloads – The maximum number of paralled downloads that can run at a certain moment of time. This is mandatory field.

Memory Settings

These settings will help save alot of memory as download information will not be kept in the applications memory.

  • Drop Ignored Links – If checked, all the links that have been ignored (rules not satisfied, limits impose etc.) will be removed from memory and they will not be present in the overall results;
  • Drop Finished Links – If checked, all the links that have been downladed and processed completelty will be removed from memory and they will not be present in the overall results.

Edit URLs List


This section offers the possibility of setting up multiple URLs from which the Job Package will start its processing.

For each of the URLs defined here all the other Job Package settings are valid, meaning that there cannot be set different rules for each of the URLs defined here. In order to achieve this multiple Job Packages must be defined.

Processing multiple Job Packages at a single moment of time is not supported at this moment, but we are working at it.

Each URL must be defined on a single line.

Note: The URLs list must not be empty.

Note: Each URL must be valid (according to the RFC #3986) otherwise Darcy will signal the invalidity with an error.


Connection


This section enables you to establish from simple to complex settings regarding all connections created during the Job Package download process.

The following sub-sections are available:

  1. Basic Connections Settings

  2. Authentication Settings

  3. Cookie Settings

  4. HTTP Response Settings


Basic Connections Settings


Specify here the main settings with regard to the network connections established during a Job Package download process. The available primary settings are:

  • Connections Limit – Defines the maximum number or connections that this Job Package is allowed to create. Once this limit is reached the Job Package download process will stop. This limitation applies to a single server, thus for a single host. For an unlimited number of connection set this limit to “-1″. The default value of this field is “-1″;
  • Retries – Defines the number of times Darcy must try to retrieve a certain web resource, when the first try resulted in an error. If this value is reached and the server did not provide yet the resource, that certain resource will be considered unreachable. The default value is “3”.

Proxy Settings

  • Address
    The address of the proxy server;

  • Port
    The port on which the proxy server is listening on;

  • User
    The user name authentication detail to be used for proxy server connection;

  • Password
    The user password authentication detail to be used for proxy server connection.

Request Settings

  • Send Referral – Signals that the “Referral” request header must be added to the sent requests;
  • User Agent – Defines the “User-Agent” request header that must be added to the sent requests.

Bandwidth Limit

  • Bandwidth Limitation – The bandwidth that Darcy must not exceed during the Job Package download process. This limitation applies to a single download thread.

Authentication Settings


Specify here authentication details that must be used in the Job Package download process in order to authenticate to certain website.

Note: These credentials will only be used for WWW Authentication mechanisms. The login through HTML forms will not use these details.

The available credential properties are:

  • Hostname
  • User Name
  • User Password

Cookie Settings


Specify here the cookie information that must be used in the Job Package download process.

These cookies can be used for authentication to certain web sites. In order to add cookies here you must login to that particular web site using your favorite browser, then analyze the obtained cookies and add them here.

Note: Even if most HTTP Servers do not take into account the order of cookies, there have been reports which state that some HTTP Server do validata authentications based on the cookie order.

The available cookie details are:

  • Name – The name of the cookie to be used;
  • Value – The value of the cookie to be used;
  • Domain – The domain on which the cookie is valid;
  • Path – The path on which the cookie is valid.

HTTP Response Settings


Define the behaviour of Darcy Ripper for certain HTTP Status Codes. For instance if an server sends 403 (Forbidden) after to many downloads, a retry + waiting time can be defined here.


Custom Rules


This section enables you to set up different behaviour rules to be considered during the Job Package download process.

The following sub-sections are available:

  1. Basic Rules
  2. Request Filters
  3. Reply Content Filters

Basic Rules


Specify here special rules to customize the Job Package download process.
The available basic rules are:

  • Depth – Defines the maximum recursion depth that must be reached during a Job Package download job;
  • Links Limit – Defines the maximum number of links that must be followed during the Job Package download process. When this limit is reached then the download process will stop;
  • Time Limit – Defiles the maximum time (in milliseconds) that a Job Package download process must not exceed. When this time limit is reached then the download process will stop.

File Size Filter

By means of these settings, the decisions can be made depending on the web resources file size. For example, in order to avoid downloading large files, these settings may be used.

The available file size filter properties are:

  • File Size (from)
    Defines the start value of the file size interval, from which files will be considered by this fiter;

  • File Size (to)
    Defines the end value of the file size interval, from which files will be considered by this fiter;

  • Reply “Content-Length’ not available action
    Defines the actual action that must be took for the file whose size if between File Size (from) and File Size (to). At this moment the two possible values are:
    • Save To Disk: saves the file to disk;

    • Reject File: rejects that particular file and will not download it.

Limits

Max. recursion depth:

The recursion depth defines how deep Darcy Ripper should crawl through linked web sites.

Time limit

Defines a time limit. If the time limit is reached, no more links are added to the “open” list. After all links in the “open” list are finished, the download ends.

Max. links to follow

Define a maximum limit of links (URLs). When the limitation is reached, no more links are added to the “open” list. After all links in the “open” list are finished, the download ends.

URL prefix Filter

Defines a prefix for the URL. When set, only URLs which are beginning with the prefix are accepted. This can be handy of only a specific directory should be downloaded. Only a string is allowed, no regular expresions.

Example: http:www.example.com/section1/

Hostname Filter

A host filter can be set if Darcy Ripper should follow only links whose hostname matches an regular expression. To do so, remove the “.*” entry from the “Allowed Hostname” box and add something like “.*google.de”. In this case Darcy Ripper will only retrieve files from an host like “images.google.de”, “google.de” or “http://www.google.de”. Be careful to not remove all entries from the filter list. In this case no hostname is allowed.

“Save to Disk” Filter

To control which filetypes should be saved on disk, this filter can be used. Only files matching one of the regular expressions are saved on disk. As an example to accept only jpegs, remove the “.*” entry from the list and add “.*jpg$”. When removing all entries from the filter list, no files will be saved on disk.


Request Filters


To download only files with a specific size, this rule can be used.


Reply Content Filters Rules


With an Regexp Chain Filter you can freely set an regular expression to match an URL and actions which should be executed when a URL matches the regular expression. These actions can be:

  • Accept the URL

    The URL will be retrieved and parsed for URLs. This doesn’t mean it will be saved on disk! This depends on the settings made in the “Files to download” box on the first tab.

  • Reject the URL

    The URL will be not be retrieved and therefore not saved on disk.

  • Change the priority

    This field takes an value which will be added to the original priority. This value can also be negative.

    Here an example:
    Priority: 500, Change Value: 50, Final Priority: 550
    Priority: 500, Change Value: -20, Final Priority: 480

  • Advise Darcy Ripper not to resume a file when it’s already on disk.

    This is useful when resuming an large download.

Every of these rules can be applied for the case the regular expression matches and when the regular expression doesn’t match.