Distributed HarvestMan, alias D-HarvestMan [DHM], is a distributed Web crawler implemented in the Python [PYT] programming language. D-HarvestMan is developed to fulfil the requirements of the EIAO (European Internet Accessibility Observatory) project [EIAO] to be one of the central system components for large-scale assessment of website accessibility.
Based on the existing HarvestMan [HMAN] project, D-HarvestMan adds several new aspects to the solution: efficiency, parallelism and fault tolerance.
D-HarvestMan consists of a single Manager instance and a set of Slave instances. The Manager is responsible for coordinating the overall crawl. The Slave instances perform most of the actual crawling.
By developing and implementing a service discovery mechanism, the manager is able to continuously look for new slave instances, and start these on remote machines in the network.
D-HarvestMan architecture allows each running instance to operate as the manager, allowing a slave instance to take over the manager responsibility if the manager crashes, leading to a truly fault tolerant, distributed system.
The prototype implementation of D-HarvestMan has been tested on multiple separate computers, providing proof-of-concept for several of the architectural choices and showing significant efficiency increase. It has also been used to identify potential bottlenecks in a future production implementation.
|