Process Migration and Fault Tolerance of BSPlib Programs Running on Networks of Workstations

Jonathan M.D. Hill and Stephen R. Donaldson and Tim Lanfear

Abstract
This paper describes a system that enables parallel programs writtenusing the BSPlib communications library to migrate processes among anetwork of workstations. Not only does the system provide faulttolerance of BSPlib jobs, but by utilising a load manager thatmaintains an approximation of the global load of the system, it ispossible to continually schedule the migration of BSP processes ontothe least loaded machines in a network. Results are provided for anindustrial electro-magnetics application that show that we can achievesimilar throughput on a publically available collection ofworkstations as a dedicated NOW.
Contact
Jonathan Hill
Oxford University Computing Laboratory,,Wolfson Building,Parks Road,Oxford,OX1 3QD
Jonathan.Hill@comlab.ox.ac.uk