 
  
  
   
We have designed and implemented a prototype board to  evaluate 
the performance of various DMA initiation algorithms. The board is plugged on 
the TurboChannel I/O bus of a DEC Alpha 3000 model 300 workstation. 
All the logic is contained in a single FPGA that is directly accessible from 
user applications via shadow addressing. The board runs at 12.5 MHz. 
For each DMA method  we perform a simple test of initiating 1,000 
DMA operations.  Successive DMA operations were done to(from) different addresses, so 
as to eliminate any caching effects that intervening write buffers 
may induce. In the Repeated Passing of Arguments method, a memory barrier 
was used to make sure that repeated accesses to the same address 
were not collapsed in (or serviced by) the write buffer.  
Table 1 presents the (average) time it took for each algorithm 
to start  a DMA operation.
Successive DMA operations were done to(from) different addresses, so 
as to eliminate any caching effects that intervening write buffers 
may induce. In the Repeated Passing of Arguments method, a memory barrier 
was used to make sure that repeated accesses to the same address 
were not collapsed in (or serviced by) the write buffer.  
Table 1 presents the (average) time it took for each algorithm 
to start  a DMA operation.
We see that kernel level DMA costs close to 19   s, which is a 
little more than  the cost of an empty system call on this 
workstation. Fortunately,  we see that all user-level DMA methods 
perform about an order of magnitude better
than the kernel-based DMA. Best of all methods is the 
``Extended Shadow Addressing'', which takes  a little more
than one microsecond. This is as expected, since this method needs only two 
assembly instructions to pass all DMA arguments to the network  
interface. The other user-level DMA methods take 2.3-2.6 
microseconds, which is also expected since they use twice as many accesses 
to the network interface.
 s, which is a 
little more than  the cost of an empty system call on this 
workstation. Fortunately,  we see that all user-level DMA methods 
perform about an order of magnitude better
than the kernel-based DMA. Best of all methods is the 
``Extended Shadow Addressing'', which takes  a little more
than one microsecond. This is as expected, since this method needs only two 
assembly instructions to pass all DMA arguments to the network  
interface. The other user-level DMA methods take 2.3-2.6 
microseconds, which is also expected since they use twice as many accesses 
to the network interface.
We should mention, however, that our implementation is pessimistic, and user-level DMA can achieve quite better performance in modern systems, that use faster buses. The TurboChannel bus that we used runs at 12.5 MHz, while recent buses, like the PCI bus run at frequencies as high as 66 MHz.
    
 
Table 1: Comparison of DMA initiation algorithms.
 
  
 