Proceedings of the 1st Workshop on Software Distributed Shared Memory (WSSM-99)
Rhodes, Greece, June 1999
Traditional software Distributed Shared Memory (DSM) Systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. This is achieved through the segmentation violation signal (segv) and an associated segv handler. While the steps taken by the segv handler themselves are unavoidable, the involvement of the OS (kernel) and the associated overhead which is significant, can be avoided by careful compile time analysis and code instrumentation.
In this paper, we propose the implementation of CASDSM, a Compiler Assisted Software support approach. In the CAS-DSM implementation, the page fault overhead is avoided by instrumenting the application code at the source level. The overhead caused by the execution of the instrumented code is reduced through aggressive compile time optimizations. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations.
In our implementation, the CAS-DSM, we rely on the linear array index analysis for detecting shared memory accesses that could potentially raise a segv. To improve the performance of our approach, instead of introducing the consistency check code immediately before the shared access, as we do in our basic implementation, we aggregate all the inserted code, one for each shared access in a loop, and hoist them above the loop. This aggregation and hoisting of the inserted code can be extended to outer loops as well. Taking an aggressive approach, we also propose to discard some of the inserted code, using a simple heuristic. To study the effect of inter-procedural analysis and optimizations in CAS-DSM, we inlined functions manually, and performed constant propagation.
We modified CVM, a publicly available software DSM to support the instrumentation inserted by the compiler. Detailed performance evaluation of CAS-DSM is reported using the Splash/Splash2 parallel application benchmarks. The benchmarks were run on an IBM-SP2, a distributed memory machine, for different number of proces sors ranging from 1 to 8. Our method achieves a performance improvement of 10% to 15% for most of the applications compared to the original CVM implementation.