By Vakul Garg (vakul@freescale.com) and Varun Sethi (varun.sethi@freescale.com), Senior Software Engineer, Freescale Semiconductor.
This is the last post of a series of three. Please find the first post here and the second one here.
7 – Fixing the problem
Using ‘perf’ tool, first we measured the number of snoops per second on control plane core with no control plane application running and the data plane paused. It was almost ‘0’. Next we measured the number of snoops per second with data plane running. The number we got represented the snoop transactions arriving at control plane due to data plane activity. In an ideal partitioning case, this number should be close to ‘0’.
The system under test was already running control and data plane applications in their own respective partitions representing separate coherency domains. For all the ethernet ports in the system, they originally shared a common set of buffer pools to pick buffers to receive frames. A direct implication of sharing pools for all the ports was that the memory used to seed buffer pools had to be declared ‘coherent’ across both control and data plane partitions.
Since each of the partition processed frames from its own ethernet port exclusively, there was no real need to use shared memory for buffer pools. We assigned two different sets of buffer pools to the ports owned by each of the partition. These pools were seeded with partition private memory buffers. Thus for any frame via 10G ports (which were owned by data plane), snoops did not reach control plane cores. After this change, we measured the snoops per second at control plane core again. It came down drastically from what was observed originally, but it was still not close to ‘0’.
The task was now to find the source of remaining snoop transactions on control plane. By reviewing the system configuration, we found that the scratchpad memory assigned to hardware accelerators QMAN and BMAN was declared coherent. As described previously, this is not required. We changed the attribute of scratchpad memory to be non-coherent and measured snoop rate again. This time it was close to ‘0’.
Finally we measured the performance of memory copy bandwith application and SIP stack again on control plane while data plane was running at its full rate. This time, the performance of control plane remained unaffected irrespective of whether data plane was running or paused.
8 – Software design recommendations
Since data plane and control plane have extremely low data sharing requirement, it should possible to run them under different coherency domain so as to restrict the snoops generated to respective domains.
Device private memory must be marked coherent
The software running on cores is supposed to never access the address range reserved as scratchpad. Hence, we are certain that none of the address in this range would ever be present in any of the core local cache. This obviates the need of declaring scratchpad address range to be coherent.
Separate the buffers for I/O ports private to control plane and data plane
For each of the I/O port private to control and data plane, a different set of buffer pools must be used. Care should be taken not to seed these buffer pools with memory which is shared across control and data plane.