By Vakul Garg (vakul@freescale.com) and Varun Sethi (varun.sethi@freescale.com), Senior Software Engineer, Freescale Semiconductor.
1 – Introduction
In case of multicore systems the cost of hardware enforced coherency increases with the increase in number of cores. This can be the attributed to the requirement of a snooping based coherent system, where each core must inspect the memory traffic for every other core. Indeed, as each of the ‘n’ nodes in a multicore system must process all other (n-1) nodes’ snoop requests, the number of coherence actions scales with O(n²). The number of coherence actions will affect the overall performance of processors because these coherence requests interfere with a core’s access to its own cache.
The cost associated with hardware enforced coherency is not well understood and thus neglected by most programmers. While configuring system coherency, programmers unknowingly make memory coherent among all hardware blocks and cores across the system. As discussed, for multicore systems ignoring coherency cost can result in low CPU throughput due to unnecessary snoop traffic.
In this paper we present guidelines to mitigate performance challenges arising out of mismatch between hardware coherency configuration and actual application requirements.
2 – Concept of Coherency
The effectiveness of Multicore systems relies on parallel software achieving continuous exponential performance gains. Most parallel software in the commercial market rely on the shared-memory programming model in which all processors access the same physical address space. Although processors logically access the same memory, on-chip cache hierarchies are crucial to achieving fast performance for the majority of memory references made by processors. Thus a key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This cache coherence problem is a critical correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory. Assuming the shared memory programming model remains prominent, future workloads will depend upon the performance of the cache coherent memory system.
A widely-adopted approach to cache coherence is snooping on a bus. A bus connects all components to an electrical, or logical, set of wires. A bus provides key ordering and atomicity properties that enable straightforward coherence operations. First, all endpoints on a bus observe transmitted messages in the same total order. Second, buses provide atomicity such that only one message can appear on the bus at a time and that all endpoints observe the message. Third, buses implement shared lines that allow any endpoint to manipulate a signal or condition that is globally visible to all other endpoints during a bus transaction. Shared lines facilitate both bus arbitration and cache coherence operations.
3 – Nature of Packet Processing Applications
Packet processing applications such as IP routers, Layer 2 switches etc are typically split into control plane and data plane. The control plane implements the algorithmic intensive part of application (e.g. route calculation). It typically does state full processing and executes long state machines per input event. The number of frames processed by control plane is a very small fraction compared to the number of frames processed by data plane.
The data plane processes bulk of the incoming frames. It typically operates upon the frame headers and the processing involves header parsing, table lookups and header modification, encapsulation, de-capsulation etc. Accessing frame headers requires the frame to be brought inside core local cache. This is accomplished either on incurring a cache miss or by stashing operation by the I/O devices which can pre-position frame headers inside core local cache.
The frames processed by data plane are typically not required to be accessed by control plane. Both the planes work pretty much independent of each of other and have very low data sharing pattern. The control plane occasionally communicates with data plane with special proprietary control events to manage tables and connections used by data plane. These events are passed using known IPC methods such as message queues.
In some cases, data plane needs to forward frame to control plane and vice versa. But overall count of such frames is extremely small.
On a Multicore processor, it is common to reserve two non-overlapping sets of cores for control plane and data plane each. The number of cores reserved for each plane depends on its compute horse power requirement. Larger number of cores for data plane means greater frame processing capability.