We have designed and implemented a prototype board to evaluate
the performance of various DMA initiation algorithms. The board is plugged on
the TurboChannel I/O bus of a DEC Alpha 3000 model 300 workstation.
All the logic is contained in a single FPGA that is directly accessible from
user applications via shadow addressing. The board runs at 12.5 MHz.
For each DMA method we perform a simple test of initiating 1,000
DMA operations.
Successive DMA operations were done to(from) different addresses, so
as to eliminate any caching effects that intervening write buffers
may induce. In the Repeated Passing of Arguments method, a memory barrier
was used to make sure that repeated accesses to the same address
were not collapsed in (or serviced by) the write buffer.
Table 1 presents the (average) time it took for each algorithm
to start a DMA operation.
We see that kernel level DMA costs close to 19 s, which is a
little more than the cost of an empty system call on this
workstation. Fortunately, we see that all user-level DMA methods
perform about an order of magnitude better
than the kernel-based DMA. Best of all methods is the
``Extended Shadow Addressing'', which takes a little more
than one microsecond. This is as expected, since this method needs only two
assembly instructions to pass all DMA arguments to the network
interface. The other user-level DMA methods take 2.3-2.6
microseconds, which is also expected since they use twice as many accesses
to the network interface.
We should mention, however, that our implementation is pessimistic, and user-level DMA can achieve quite better performance in modern systems, that use faster buses. The TurboChannel bus that we used runs at 12.5 MHz, while recent buses, like the PCI bus run at frequencies as high as 66 MHz.
Table 1: Comparison of DMA initiation algorithms.