EXODUS and RVM

Next: Recovery Up: Remote-Memory-based Transaction Systems Previous: Reliable Main Memory

EXODUS and RVM

To illustrate our approach we have modified a lightweight transaction-based system called RVM [25] and the EXODUS storage manager [6] to use remote memory (instead of disks) for synchronous write operations. After studying the performance of the systems, we concluded that they spend a significant amount of their time, synchronously writing transaction data to their log file, which is used to implement a two-phase commit protocol. When a transaction commits, all the data the transaction modified are synchronously written to the log (stored as a UNIX file on a magnetic disk). After the mentioned data are successfully written to the log, the system is allowed to proceed.

We have modified both EXODUS and RVM so as to to keep a copy of their log file in remote main memory (as well as the disk). The unmodified systems force all their sensitive data to the disk at transaction commit time using synchronous disk write operations. In our modified systems, we substitute each synchronous write operation with the following two operations:

A synchronous write to the log ``file'' in the main memory of one remote workstation.
An asynchronous write to the log file on the magnetic disk. This operation is being carried in the background and is used to preserve a local copy of the data in case the remote main memory crashes.

Essentially, we substitute a synchronous disk write operation with a synchronous network write operation plus an asynchronous disk write operation (which has no effect on completion time since it proceeds in the background, as long as adequate data buffering is provided). At the same time, our systems do not compromise data reliability. Let's examine what are the steps in writing data in our systems:

At transaction commit time, the transaction's sensitive data are synchronously written to the log in remote main memory
At the same time, these data are asynchronously written to the local magnetic disk
Eventually, the data reach the magnetic disk.

The transaction is committed after step 2 completes. It seems that there is a ``window of vulnerability'' between steps 2. and 3., that is after the data have been safely written to remote memory (and scheduled to be written on the disk), but before the data have been safely written to magnetic disk. If the local system crashes during this interval, then the data that are still in the local main memory buffer cache will be lost during the crash. Fortunately, our system can still recover the seemingly lost data, since the same data reside in the remote memory as a result of step 1. Data loss may happen only if both local and remote systems crash during this interval. However, we have argued that the probability of both systems (which are equipped with UPSs) crashing during the interval of few minutes is comparable (or even lower) than the probability of a magnetic disk malfunction. Thereby our system provides levels of reliability comparable to a magnetic disk.

Next: Recovery Up: Remote-Memory-based Transaction Systems Previous: Reliable Main Memory

Evangelos Markatos
Fri Apr 11 14:07:02 EET DST 1997