2.3 Reliability Issues

Next: 2.3.1 Mirroring Up: 2 The Design of Previous: 2.2 Device Operation

2.3 Reliability Issues

In a distributed system, a workstation may crash at any time (e.g. due to software or human error). If the crashed workstation acts as an NRD server, it will lose the disk blocks of the client. Clearly, it is not acceptable for applications running on the client workstation to lose their data files due to remote server crash. Instead, we would like to be able to recover their disk blocks. Otherwise a remote server crash is certain to cause, not only an application crash and important data loss, but to result in a client crash as well, especially in the case where the Network RamDisk is used to store swap space of applications.

If the crashed workstation acts only as a client, after recovering from the crash, it can reconnect to the server hosts and find the Network RamDisk in the same state as its magnetic disks were after the crash. Running fsck will certainly correct inconsistency problems.

There are many types of crashes. First of all there may be machine crashes due to loss of power. This situation is not addressed in this paper, since most computer buildings are equipped with UPSs. The most frequent cause of crash is a software crash. To avoid loss of data due to a server crash, some systems write all network memory data to the disk as well ([1, 11]). Instead, we design a reliable Network RamDisk that is able to reconstruct the lost disk blocks when one server fails.

To provide this level of reliability, some form of redundancy must be used. The main issues that must be taken into account regarding the form of redundancy used are :

The runtime overhead introduced must be minimal since it is a cost paid even when no server crashes.
The memory overhead introduced must be as low as possible because the memory reserved for reliability could be used locally at each workstation in order to store its memory pages.
The runtime overhead of recovery from failure should be low.

We consider and discuss two different reliability policies: mirroring and a novel policy called adaptive parity caching.

Next: 2.3.1 Mirroring Up: 2 The Design of Previous: 2.2 Device Operation

Mike Flouris
Thu Sep 17 18:12:15 EET DST 1998