Abstract
An availability management service is responsible for automatically ensuring that all critical services of a distributed system remain continuously available to users despite node removals and restarts caused by failures, maintenance and growth. We present an availability management service for an asynchronous distributed system characterized by unbounded communication delays and by the availability at all nodes of local, nonsynchronized timers that measure the passage of real time with some known accuracy. Examples of such systems are Unix, VMS, VM or MVS based distributed systems connected by local area networks such as Ethernet, token ring, FDDI, or channel-to-channel adapters. The presentation stresses the main ideas behind this new service, and outlines a simple design that depends upon the existence of asynchronous membership and atomic broadcast group communication services.

This publication has 13 references indexed in Scilit: