Lab for High Performance Computing SERC, Indian Institute of Science
Home | People | Research | Awards/Honours | Publications | Lab Resources | Gallery | Contact Info | Sponsored Research
Tech. Reports | Conferences / Journals | Theses / Project Reports

Efficient Techniques Exploiting Memory Hierarchy to Improve Network Processor Performance

MSc(Engg) Thesis, Supercomputer Education and Research Centre,
Indian Institute of Science, Bangalore, India, June 2008.


  1. Girish B.C., Supercomputer Education and Research Centre


The performance of network processors depends on the architecture of the chip, the network processing application and the workload characteristics. In this thesis, we model the memory hierarchy of a network processor and its interaction with the network application. We use the observed characteristics from internet traces to propose new packet buffer allocation policies, dynamic buffering for core routers and to propose a new cache replacement policy for caching the results of packet classification. Previous studies have shown that buffering packets in DRAM is a performance bottleneck. In order to understand the impediments in accessing the DRAM, we developed a detailed Petri net model of SDRAM, the memory used as packet buffer in routers. This model is integrated with the Petri net model of IP forwarding application on IXP2400. We show that accesses to small chunks of data less than 64 bytes reduces the bandwidth realized by the DRAM. With real traces, up to 30% of the accesses are to smaller chunks, resulting in 7.7% reduction in DRAM bandwidth. To overcome this problem, we propose three buffering schemes which buffer these small chunks of data in the on-chip scratchpad memory. These schemes also exploit greater degree of parallelism between different levels of the memory hierarchy. Using real traces from the internet, we show that the transmit rate can be improved on an average by 21% over the base scheme without the use of additional hardware. Further, under real traffic conditions, we show that the data bus which connects the off-chip packet buffer to the microengines, is the obstacle in achieving higher throughput.

Earlier studies have exploited the statistical multiplexing of flows in the core of the internet to reduce the buffer requirement. We observe that due to over provisioning of links the buffer can be substantially reduced. This enables us to the use on-chip memory in the IXP 2400 to buffer packets during most regimes of traffic. We propose a dynamic buffering strategy in network processors which buffers packets in the receive and transmit buffers when the buffer requirement is low. When the buffer requirement increases due bursts in the traffic, buffer space is allocated to the packets by storing them in the off-chip DRAM. This scheme effectively mitigates the DRAM buffering bottleneck, as only a part of the traffic is stored in the DRAM. We show that with an input traffic rate of 6300Mbps, the dynamic buffering scheme has lower packet drop rate than the DRAM buffering scheme. Even with 90% output link utilization, the dynamic buffering scheme is able to support a line rate of 7390Mbps and a packet drop rate of 2.17%. This packet loss is still lesser than the DRAM buffering scheme which drops more than 4% of the packets even with a traffic rate of 5417Mbps.

The distribution of flows sizes in the internet is highly skewed with a few large flows which transfer a majority of packets and numerous small flows that contribute only a small fraction to the traffic. The skewed distribution of flows leads to degraded performance of result caches with LRU policy. This cache replacement policy tends to retain the results belonging to the short flows in the internet. We propose a new cache management policy, called Saturating Priority, which leverages this distribution of flows to remove results belonging to small flows from the result cache, while trying to retain the results for large flows. The proposed replacement policy outperforms the widely used LRU cache replacement policy for all traces and cache sizes that we considered. It also covers up to 74% of the gap between LRU and an oracle policy which inserts a result into the cache only if the flow has multiple packets.


Full Text