Archive for the ‘Software Implementation’ Category
By Vakul Garg (vakul@freescale.com) and Varun Sethi (varun.sethi@freescale.com), Senior Software Engineer, Freescale Semiconductor.
This is the second post of a series of three. You can find the first post here.
4 – Generation of snoops due to packet processing
When the ingress I/O controller (e.g. Ethernet) copies the frame from its DMA internal FIFO to memory, snoops are generated to invalidate copies in any of core local caches. Also when the frame headers are brought inside local cache of cores running data plane (either on demand miss or by stashing), snoop transactions are generated. This causes all the other cores in the same the coherency domain to check whether they have a copy of the address being accessed inside their local caches. If any of them has a copy, it is invalidated (since it must be stale copy).
The snoop requests to control plane cores generated due to data plane activity reduce number of productive cycles in which a control plane cores can complete instructions. This results in lower IPC (Instructions per cycle) count.
5 – Source of Snoops
Device Generated
Embedded multicore networking processors often pack many I/O devices (e.g. ethernet, RapidIO etc) and hardware offload accelerators alongwith cores. These accelerators are capable of executing common functions required for efficient packet processing. E.g. Freescale QorIQ platform P4080 has QMAN (for queue management), SEC (for cryptographic processing), BMAN (for buffer pool management) etc. The accelerators often require system memory as a scratchpad to store their own private data structures for housekeeping. If the scratchpad memory is declared as coherent, any access to this memory by the accelerator would cause snoop transactions on the system bus.
When I/O device or an accelerator reads frame contents (e.g. frame transmission by ethernet controller or encryption by crypto block), snoop transactions are generated since any of the core’s local cache might have most recent modified copy of frame. Similarly when a frame is written by an accelerator (e.g. IPSEC encapsulation by crypto hardware), snoops are generated for the addresses being written to invalidate them if they are present in any core local cache.
Software (core) generated
Access to an address by the software running on the core would generate snoop transactions if the address falls in a page marked coherent . On a multicore SMP system, usually whole of the memory is declared as coherent. In many cases, this becomes the source of many un-necessary snoops into the cores. E.g. if a certain piece of data by design is always accessed at a fixed core, then coherency maintainence is not required for its address. In a multicore packet processing system, accessing frame headers may generate snoops depending upon whether the address being accessed is in exclusive state in core local cache.
6 – Case Study
To get an idea of the impact of snoops (due to data plane activity) on the control plane performance, we setup an experiment on Freescale’s multicore QorIQ platform P4080 having 8 CPUs. We used Freescale’s embedded hypervisor software to setup two static partitions in the processor. The first partition (1 CPU) ran Linux based control plane and the second partition (7 CPUs) ran data plane based on Freescale’s LWE (Light Weight Executive).
The data plane was assigned two 10G ethernet ports and the control plane was assigned single 1G ethernet port.
The data plane was used to run a baremetal run-to-completion packet reflecting application. It received IPv4 packets from two 10G ports on the processor. The Ethernet frame size used was 64 bytes. Both the 10G ports were used at line rate. The data plane reflected back all the incoming frames through the same ethernet port from which the frame was originally received after swapping source and destination IP addresses and MAC addresses.
We tried two different applications on control plane. These are described below. We observed the performance of both of these control plane applications when data plane application was paused and running. The number of snoops reaching the control plane core were counted using open source tool ‘perf’. This tool uses core’s performance monitor hardware to count snoop request events.
Memory copy bandwidth test
Our first application on control plane was memory copy bandwidth test (bw_mem) from Lmbench benchmark suite. We used it to execute copy of very large sized buffers (1GB). The performance metric collected was the buffer size that could be copied per second.
SIP stack
The second application we tried was a real world application. We used open source SIP (Session Initiation Protocol) implementation (PJSIP software) on control plane. The performance of SIP stack was measured by running PJSIP in both client and server mode on same control plane partition. The server and client were connected through loopback interface. The time taken by the SIP client to start and terminate 20000 calls was measured.
We found that both of the above mentioned control plane applications experience a slowdown when data plane and control plane were simultaneously run compared to when data plane was paused.
The memory copy test experienced a slowdown of about 20%. In SIP stack, it was about 10%.
Note that here we used a minimalist data plane application. If the data plane application uses a hardware offload accelerator such as security block for frame encryption and decryption in a pipelined processing fashion, then each frame would be received twice at data plane cores resulting in double the amount of snoops and hence even larger performance degradation at control plane cores.