Evangelos P. Markatos
Institute of Computer Science (ICS)
Foundation for Research & Technology - Hellas (FORTH)
Heraklio, Crete, GR-711-10 GREECE
In Proceddings of MASCOTS 96, pages 69-73, San Jose, CA, Feb 1995
The increasing use of high-bandwidth and low-latency networks make possible the use of remote (network) memory as an alternative to disk means of storing an application's data, because remote-to-local memory transfers over a modern interconnection network are faster than traditional disk-to-memory transfers. In this paper we explore the possibility of using the remote memory as (i) a (faster-than-disk) backing store, (ii) an extension of main memory accessed using single (remote) memory references, and (iii) as a combination of both. We use execution driven simulation to investigate the performance impact the use of remote memory has on several real programs. We conclude that even for today's low throughput networks, using remote memory as a place for storing (some) of an application's data may result in significant performance improvements, which will continue to get higher, as the disparity between disk transfer rates and network transfer rates continues to increase.
Recent applications like multimedia, windowing systems, scientific computations, engineering simulations, etc. running on workstation clusters (or network of PCs) require an everincreasing amount of memory. Although the total amount of main memory in a workstation cluster is usually sufficient for these applications, the amount of physical memory provided by each individual workstation is usually insufficient to cover the needs of current and near future applications. To make matters worse, the use of multiprogramming and time-sharing reduces the amount of physical main memory available to each application even more. To keep the workstation costs low, most workstations usually have a powerful CPU, but they do not have the amount of main memory that some applications need. Thereby, applications run efficiently as long as their working set fits in the main memory of the workstation. As soon as the working set exceeds the main memory size, the performance of the application suffers severely.
We believe that applications who need more main memory than a single workstation can provide, should make use of the remote main memory available in a workstation cluster, provided that the appropriate system software support exists. In this paper we explore the trends and circumstances under which the use of remote memory for storing an applications data is beneficial from a performance point of view. Recently, there have been two recent architecture developments that make the use of remote memories attractive:
In this paper we show that it is performance-wise to use remote memory to efficiently store and retrieve an application's data in workstation clusters. In section 3 we describe several methods with which applications can make use of the available remote memory: (i) as a fast RAM disk, (ii) as a (slow) main memory accessed via regular read and write operations, and (iii) as a combination of both. In order to evaluate the various architecture alternatives and the various paging methods build on top of them, we use execution-driven simulation, and present our performance results. Section 4 places our work in the proper context by comparing it with related work. Finally, section 5 discusses extensions of our approach, and presents our major conclusions.
The architecture we assume is a distributed system composed of workstations and an interconnection network. Each workstation consists of a processor and its local memory. The memory of all the other workstations in the system comprise the remote memory. In the simplest version of this distributed system (the DISK), the remote memory is not used. The only paging device is the disk. When a page needs to be evicted from local memory, it is sent to disk; when a page needs to be fetched to local memory, it is fetched from disk. This is exactly what most paging systems do today.
To facilitate fast page transfers, remote memory could be used as a paging device (RAM_DISK). When a page must be paged-out to make room for newly referenced data, the pager sends the page to remote memory. When the page is later needed, it is fetched from remote memory. RAM_DISK allows only whole page transfers at a time. Both RAM_DISK and DISK experience the same number of page faults, but RAM_DISK does not suffer from seek and rotational latencies as the disk does.
In our next version of the architecture, the REMOTE_MAIN, remote memory is used as main memory accessed using single read and write operations, while the disk is used as the backing store. The pager works as follows: When a page is referenced, it is brought into local main memory if there is an available frame. Otherwise, it is sent to remote main memory. If neither part of main memory has an available frame, a page residing in local memory is evicted to disk to make room for the new page. Once a page has been mapped into an (either local or remote) frame, it is never moved (unless both local and remote memories fill up). Both local and remote memory are accessed via regular load and store operations.
In the last version of the architecture, the COUNTERS, we assume the existence of a reference counter per remote page. When a processor accesses a remote page, the counter associated with the page is decremented. When the counter reaches zero, an interrupt is sent to the operating system, which replicates the page locally, sending a local page to remote memory to make available space. This architecture is a combination of RAM_DISK and REMOTE_MAIN, and we believe that combines the advantages of both. When a page is frequently accessed, COUNTERS migrate the page to local memory, much like RAM_DISK, thus providing fast access to the page. When a page in infrequently accessed, COUNTERS provide remote access to the page, much like REMOTE_MAIN, thus avoiding unnecessary page transfers.
In this section we evaluate the performance of various remote memory configurations, and of various paging policies. We use ATOM , an object code rewriting tool that gathers traces of applications, and simulates a memory system on-the-fly. ATOM takes as input an executable and instruments it with calls to simulation routines. The application is actually executed, while at the same time, our simulator runs to simulate the paging architecture we want. The parameters of the architecture simulated can be found in table 1.
We have chosen the applications from several domains (numerical analysis, matrix operations, data management, and circuit simulation) so that they represent a wide variety of data access patterns. We use the following applications:
The working set and trace length of each application are described in the following table:
Table 1: Architecture Parameters.
In all our simulation studies the paging policy is LRU. We have also experimented with a random page replacement policy, and have seen similar results.
Figure: Paging policies. Performance of our application suite under various paging configurations. In almost all cases COUNTERS has the best performance, followed by RAM_DISK.
The first question we set out to answer is the usefulness of remote memory for storing an application's data. We executed the applications on a DEC Alpha workstation and simulated the paging architectures as described. For each application/architecture combination we measured the average memory access cost of the application running on the architecture, and plotted the results in figure 1. We see that in three out of five applications (TRANS, GAUSS and SORT) DISK performs worse than the other architectures, and sometimes, substantially so. In TRANS for example, where the is no locality at all, the DISK policy performs 24349/65.65=371 times worse than the COUNTERS policy. The inferiority of the DISK policy over COUNTERS is obvious (but not as pronounced as in TRANS) in GAUSS and SORT applications as well. The only exception is the MVEC application in which DISK is better than counters by about 10%. The reason is that MVEC reads (most of) its input only once and never uses it again; thus, any sophisticated paging policies do not improve performance .
Among the policies that make use of remote memory, COUNTERS seems to be the best (or close to the best) in all cases. RAM_DISK is better than REMOTE_MAIN with the exception of the TRANS application, where the system serviced an extraordinary amount of page faults. REMOTE_MAIN is almost always the worst policy, because the remote memory access cost is large (100 to 1) over the local memory access cost. Li and Petersen  have done experiments on a system where the remote memory access cost was only twice the local memory access cost, and in their system, REMOTE_MAIN was much more attractive than RAM_DISK. Unfortunately, as architecture trends suggest, the cost of remote memory access will continue to increase compared to local memory access cost, , and thus we expect the usefulness of REMOTE_MAIN to decrease. However, we should note that the ability to make single remote memory accesses is valuable when used prudently, as in the COUNTERS policy.
Figure 2: Performance of paging policies as a function of Network throughput: GAUSS application. The performance of DISK is insensitive to network throughput. All other policies improve with network throughput. It is interesting to note that even when network throughput is as low as the disk transfer rate (5 Mbytes/sec), DISK is still inferior to remote memory paging policies.
Storing an application's data in remote memory hasn't been popular so far, partly because local area networks did not provide high throughput, making memory-to-memory transfer over a LAN comparable to a disk-to-memory transfer over the local disk. To evaluate the effect the network throughput has on remote paging policies, we simulated the execution of our applications and varied the interconnection network throughput from 1Mbyte/sec to 100 Mbytes/sec. The results for GAUSS are shown in figure 2 . When the network throughput is low (1 Mbyte/sec), the DISK outperforms all remote memory paging policies because the disk-to-memory transfer is 5 Mbytes/sec, thus disk-to-memory transfers are (almost) 5 time faster than memory-to-memory transfers over the network. However, when the network throughput becomes equal to the disk-to-memory throughput (5 Mbytes/sec), both COUNTERS and RAM_DISK outperform the simple DISK policy; even when the network throughput is the same as the disk throughput it is more efficient to use the remote memory instead of the disk, as backing store! The reason is simple: remote memory does not suffer from seek and rotational delays as the disk does. When network throughput is as high as 100 Mbytes/sec, the DISK is clearly inferior to all remote paging policies. Even though when the network throughput is moderately high (close to 25 Mbytes/sec), the performance of remote memory paging policies is clearly higher than that of the DISK.
The COUNTERS policy does not bring a page to local memory on its first access, like the RAM_DISK and the DISK policy. Instead, it maps the page remotely, and allows the processor to make remote memory accesses to it. Only after the processor has make a sufficient number of remote accesses, then the page is migrated locally. This number is set by the operating system on a per page basis. If the number is very low, COUNTERS behaves much like RAM_DISK, bringing pages locally (almost) on the first access. When the number is very high, COUNTERS behaves much like REMOTE_MAIN, (almost) never replicating pages. When the performance of REMOTE_MAIN is sufficiently different from RAM_DISK, then adjusting the initial value of the counter, makes COUNTERS behave like the best policy. In this set of experiments we investigate the influence of the initial value of the counter on the performance of the COUNTERS policy. To understand the influence of the initial value of the counter has on the COUNTERS policy, we simulated the execution of our applications varying the initial value of the counter for each page and plotted the result in figure 3. We see that the performance of TRANS application improves with the counter value. The reason is that TRANS has almost no locality at all. That is, the processor will make only a few accesses to each page, before paging it out, and thus, it is not worth it to bring the page locally. Thus, the higher the counter value, the better the performance of the application is. Exactly the opposite holds for SORT, whose performance get worse with higher counter values. The reason is that the sorting application makes several accesses to each page before the page is paged out, thus, it is worthwhile to bring pages in the local memory as soon as possible. For MVEC and GAUSS the initial value of the counter makes little difference in performance because both REMOTE_MAIN and RAM_DISK perform close to each other.
By observing closely figure 3 we note that for all applications when the counter value is between 64 and 512, the performance of the COUNTERS policy is (almost) the best. Thus, we see that the COUNTERS policy does not require difficult tuning. An initial value between 64 and 512 for each counter is a reasonable choice.
Figure 3: Tuning the COUNTERS policy. When the initial value of the counter is between 64 and 512, all applications perform close to their best.
Although providing single remote memory accesses is helpful to avoid thrashing (like the TRANS application in figure 1), few architectures provide hardware implemented remote memory accesses. Several systems though, implement them completely, or mostly, in software at reasonable performance cost. To investigate the usefulness of software implemented remote memory accesses we simulated the performance of our applications running under the COUNTERS policy using software implemented counters (called SOFTWARE_COUNTERS). The performance of the SOFTWARE_COUNTERS compared to the known DISK, COUNTERS, and RAM_DISK policies can be found in figure 4. We see that the performance of SOFTWARE_COUNTERS is only slightly worse than the performance of hardware provided COUNTERS. Moreover, SOFTWARE_COUNTERS perform significantly better than RAM_DISK for applications with no locality (e.g. TRANS), while for the other applications that have good locality, SOFTWARE_COUNTERS performs comparable to RAM_DISK, and COUNTERS.
Figure: Paging policies. The effectiveness of simulating counters in software (SOFTWARE_COUNTERS) is evaluated. In most cases, SOFTWARE_COUNTERS perform very well, always close to hardware COUNTERS.
This paper evaluates the performance benefits of using remote memory as a place to store an application's data. Its main contributions are (i) the proposal of a new policy (COUNTERS) that uses remote memory both as main memory and as backing store, (ii) the evaluation of memory management policies for real applications, and (iii) the systematic exploration of various parameters that influence the use of remote memory for storing an application's data.
Li and Petersen  have implemented a related system where they add main memory on the I/O bus (VME bus) of a computer system. This memory can be used both as backing store, but also as (slow) main memory accessed via regular load and store operations.
Several research groups study the issues in using remote memory in a workstation cluster to improve paging performance [2, 3, 5, 8, 10, 13, 15] and file system performance [1, 6, 7]. Our work differs from previous approaches in two aspects: (i) we provide extensive performance results based on execution-driven simulation of real applications, while previous approaches have provided very limited performance results (at least with respect to paging), and (ii) we explore the use of single remote memory references as a mechanism to access infrequently used pages by proposing a new policy called COUNTERS. This policy clearly outperforms all previously proposed policies.
In this paper we present a memory shortage problem faced by several applications that run on a tightly-coupled distributed system, or multiprocessor system, and explain how to use remote memories to store an applications' data. We use execution-driven simulation of real applications to evaluate the usefulness remote memory. We have shown that in most cases, using remote memory for storing an application's data is faster than the disk, sometimes substantially so, especially when the disk suffers from thrashing.
Based on our experiments we conclude that:
Based on our conclusions we believe that storing an application's data on remote memory (instead of the local disk), in several cases results in performance improvements, which will increase in the near future, if current architecture trends continue to hold.
Using Remote Memory to avoid Disk Thrashing:
A Simulation Study
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Evangelos Markatos on Mon Aug 12 19:02:02 EET DST 1996