top left top right
logo left yoursite.com logo right
Project description  |  Project documents  |  Links  |  Contact 
 
 

Project description

Distributed HarvestMan, alias D-HarvestMan [DHM], is a distributed Web crawler implemented in the Python [PYT] programming language. D-HarvestMan is developed to fulfil the requirements of the EIAO (European Internet Accessibility Observatory) project [EIAO] to be one of the central system components for large-scale assessment of website accessibility.

Based on the existing HarvestMan [HMAN] project, D-HarvestMan adds several new aspects to the solution: efficiency, parallelism and fault tolerance.

D-HarvestMan consists of a single Manager instance and a set of Slave instances. The Manager is responsible for coordinating the overall crawl. The Slave instances perform most of the actual crawling.

By developing and implementing a service discovery mechanism, the manager is able to continuously look for new slave instances, and start these on remote machines in the network.

D-HarvestMan architecture allows each running instance to operate as the manager, allowing a slave instance to take over the manager responsibility if the manager crashes, leading to a truly fault tolerant, distributed system.

The prototype implementation of D-HarvestMan has been tested on multiple separate computers, providing proof-of-concept for several of the architectural choices and showing significant efficiency increase. It has also been used to identify potential bottlenecks in a future production implementation.

 

Project documents

Project plan First presentation
D-HarvestMan Arch. v0.8 Second presentation
Final report Third presentation
 

Links

Harvestman Mercator web crawler
Web mining and dataanalysis (IKT407) Focused crawling
UbiCrawler Labrador crawler
Distributed crawling using migrating crawlers Minimizing the network distance in distributed crawling
Heritrix - Home Page CyWeillance Internet Statistics
The Official BitTorrent Home Page The Official Gnutella Home Page
PYRO - Python Remote Objects PyLinda - Distributed Computing Made Easy
   
 

Contact

Arild Andås
Email: arild@andaas.net
Phone: (+47) 932 87206
www.andaas.net
Dinko Hadzic
Email: dinko.hadzic@gmail.com
Phone: (+47) 93 841 842
 
Anand Pillai
Email: abpillai@gmail.com
Phone: +91-80-57709341
randombytes.blogspot.com
 
 

 

Design by: Arild Andås

bottom left Copyright 2005 Harvestman. All Rights Reserved. bottom right
bottom bottom