M.E. (Reg.) Thesis, Department of Computer Science and Automation,
Indian Institute of Science, Bangalore, India, January 1997.
Distributed memory multiprocessing offers a cost-effective and scalable solution for a large class of scientific and numeric applications. Unfortunately, programming in the distributed memory systems is quite tedious due to the presence of multiple disjoint address spaces and the need for explicit communication. To allow programmers to use the convenient shared memory model for programming, distributed shared memory approach provides the illusion of globally shared memory by implementing a shared memory abstraction on a physically distributed memory system.
We have developed, DSM-SP2, a distributed shared memory system built on a distributed memory machine, IBM SP2. DSM-SP2 is implemented completely in software as a set of user-level library routines on top of the AIX operating system without requiring any modifications to the operating system or any additional compiler support. It uses the MPL interface provided by the SP2 system for communicating across the different nodes, inter-connected by a 150MB/s high performance switch.
DSM-SP2 tries to reduce the amount of interprocessor communication required to keep the shared memories consistent. To achieve this, it uses the implementation of the release consistency model and also allows multiple concurrent writers to modify a page, thereby reducing the impact of false sharing. The modifications done by a process in a particular node are communicated to other nodes in the form of diffs and it uses the eager diff creation approach for this purpose. In order to reduce the communication time, our DSM-SP2 implementation uses the user-space path of the high performance switch for inter-processor communication.
This report presents in detail the design and the implementation of DSM-SP2. The speedups obtained on a 8 processor DSM-SP2 for the three benchmark programs namely, Water, Jacobi and Tomcatv are 2, 3 and 1.5 respectively. Our performance results indicate that reasonable speedups can be achieved for large programs by properly parallelizing the code and exploiting the locality of data.