Alexander Bachmutsky – Chief Architect at Nokia Siemens Networks
This post is the continuation of the discussion started by Kin-Yip Liu from Cavium Networks.
In general, I do agree that the best solution for the Fast Path data plane processing is to use a bare metal environment, which provides the most optimized way to process the traffic. At the same time, there is also a possibility of the Fast Path processing implemented in standard Linux, but as rightfully stated in the abovementioned post, “This is because synchronization and scheduling processing among multiple cores fundamentally takes significant overhead.”
However, I would like to argue that this problem can have a relatively “easy” solution. Already today Linux supports the concept of the thread affinity, when a particular thread can be assigned to a specific core. If we assume that only a single thread had been assigned to the core, and that this thread is implemented as a run-till-completion module, and the thread has the required access to the hardware, the final processing will be very similar to a bare metal implementation. The remaining problem is that Linux still tries to schedule some interrupts or other threads to this core, and this is what I called above “easy” solution: remove that particular core from Linux thread scheduling and timer scheduling. With this change there is no reason why Linux-based Fast Path will not show the performance level very similar to the bare metal case.
There are a number of things to remember in this approach:
1. There should be at least one core running “regular” Linux with its full scheduling and interrupt handling. This is one of disadvantages compared to a full AMP based bare metal approach.
2. There could be any number of such special threads limited to a total number of cores minus one “generic” core.
3. The desired thread has to have an access to the HW blocks. It can be done either through kernel threads or if the processor manufacturer provides a library for a user level thread to access the hardware directly. Such libraries exist from a number of the latest generation multicore processors. Full implementation would allow the HW access to be performed through the low overhead virtualization layer (hypervisor).
4. It is highly recommended to use very strict memory protection mechanisms to make sure that this special thread cannot harm other tasks in the system and vice versa.
5. There should be a communication library (in SW or HW) to exchange information with standard Linux cores.
So what are other pros and cons? The main problem is that current Linux does not support this functionality. We’ve discussed the feature with a number of Linux distribution companies for probably last 2 years, but the only announced similar capability had come recently from Tilera in a form of Zero Overhead Linux (ZOL).
The major advantage is much shorter software porting for the environment and the availability of the toolchain, libraries and utilities, debugging and other “goodies” from the Linux.
There is no doubt that bare hardware can still provide even more optimized implementation, but the simplicity and time-to-market of the proposed Linux-based concept can make it a popular development path. The biggest shift would happen if and when the concept is accepted into the kernel.org and appear eventually with every Linux distribution.
If a good Linux based Fast Path is achievable (let’s assume there is no technical issue), then what are the differences in code size against a regular executive which runs in “bare metal”?
Is it just a problem of lack of Linux API?
From 6WIND’s experiences,
If you “boot” baremetal:
– you need about 1500 to 2000 lines of assembly and C code (bootstrapping, interrupt handler, memory initialization, etc.) depending of the multicore CPUs
If you “boot”/run from Linux userland:
– you need about 300 to 500 lines of ‘main()’ code to start the Fast Path
So, it seems that Linux is 4 times simpler than the Executive environments, but, then, there are the SoC CPU pieces (ethernet, crypto engines, packet parsers, performance counters, packet buffers, hardware allocators, etc.):
– there are about 100K lines to 200K lines of code for HW management
It means that the complexity of the code is only improved of 1.2% with a Linux userland Fast Path against a baremetal one. It is a small benefit.
Of course, if we have to count all the other benefits of Linux (easier to debug, to manage, to start/stop, etc.), the ratio would get better for Linux, but the size of “SoC specific CPU code” is so high, that we need to keep in mind that for any environment, there is a need of having a unified API to be able to:
– get packets in/out
– manage memory
– call crypto engines
– etc.
which are the majority of the code complexity.
Then, once we get it, this unified API could be reused inside Linux.
At 6WIND, we have created a software layer to make this convergence API (named FPN SDK). Then, when a Linux userland solution is available, it can be applied too, but it will not change the issues of the specific code for the SoC.
So, it seems that the “drivers” for a Linux userland Fast Path won’t be the unified API because the gain of code remains low (unless the CPU vendors co-work together or somebody create an open API that they should integrate).
I strongly believe that the main gains will be the manageability of a Linux based Fast Path.
Should we list the requirements of manageability when it runs as “userland”?
– be managed as a Linux process/daemon, it means:
– start/stop anytime for in service upgrades
– change core distributions during runtime according to traffic load
– debuggers
– profilers
– support graceful restart for in service upgrades
– ???