PhD Thesis, Supercomputer Education and Research Centre,
Indian Institute of Science, Bangalore, India, October 2007.
Cluster computer systems assembled from commodity off-the-shelf components have emerged as a viable and cost-effective alternative to high-end custom parallel computer systems. A cluster is a collection of independent computer systems combined into a unified parallel computing system through software and networking. Over the past decade, cluster computer systems have gradually come to dominate both the high availability and high performance computing platform markets. In our research, we choose to take an application-centric view of cluster performance; we are interested in investigating how scalable application performance can be achieved on clusters. We study the performance of two different applications namely I/O intensive application and communication intensive application.
The first application which is database query processing is an I/O intensive application. First we systematically demonstrate that in a large cluster with high disk bandwidth, facilitated by the consolidation of disk I/O bandwidths provided by the individual cluster nodes and the shared nothing architecture, the processors in the cluster – despite assuming aggressive processing capabilities in each node, and the I/O bus bandwidth are the two major performance bottlenecks in database applications. For this we develop a Petri net model of parallel query execution on a cluster. We address the above two performance bottlenecks, viz processing capabilities of the cluster nodes and the I/O bus bandwidth, by offloading certain application related tasks to the processor in the network interface card. Offloading application tasks to the processor in the network interface cards helps to shift the bottleneck from the cluster processors. Encouraged by the benefits of offloading application tasks to network processors, we explore the possibilities of performing the bloom filter operations in network processors.
The later part of the thesis deals performance of Community Atmospheric Model (CAM), a large scale parallel application for global weather and climate prediction, which is a communication intensive application, that involves communcation of large messagess. Thus it does not scale well with increasing number of processors.We observe that, for large message sizes, the communication latencies increase significantly with message sizes. Hence reducing the message size can reduce the latencies, which, in turn, can improve the execution time and scalability of the application. Lossless compression is employed as it is required for retaining the accuracy and correctness of the program. However lossless compression of floating point values (which constitute the messages in the CAM application) does not lead to a good compression ratio, resulting in only marginal benefits. In certain cases, even the marginal reduction achieved in message latency is more than offset by the compression overheads. We therefore propose lossy compression technique which can achieve a high compression, yet retain the accuracy and numerical stability of the application which achieving a scalable performance.