By Eric Carmes – 6WIND Founder and CEO
One of the primary reasons to select multicore processor technology is performance. Performance benchmarks are of course very important to select hardware and packet processing software.
At the beginning of the design phase of equipment, one of the most important designers’ issues is to evaluate a complete configuration of fully operational packet processing software. This is to gage the real application’s performance as to how it will run on the selected multicore hardware. The main questions to be answered are as follow:
- What is the impact of a full-featured implementation of all the required protocols on the system performance?
- How modular is the implementation, to perform only the necessary processing for a dedicated application and progressively introduce new features without redesigning the software?
- What is the impact on global performance after the integration of Data Plane functions with the Control Plane?
- How does performance scale over the number of cores?
The very first benchmarks that are generally provided are about IPv4 forwarding showing how the performance scales over the number of cores. Benchmarks should be provided in Mpps (Million of packets per second), not Mbps because it could lead to wrongly estimated results when the overhead of the Ethernet Inter Frame Gap (IFG), pure Ethernet bandwidth or pure IP payload bandwidth is included.
“Fast – Fast Path Forwarding” options can be implemented, making some shortcuts in packet processing by removing some strictly required protocol checks. Performance improvements are nice but these shortcuts can introduce packet de-sequencing that is generally totally unacceptable for the whole system.
The impact of the addition of new protocols should also be demonstrated. To start with, what is the performance of IPv4 Fragmentation and Reassembly? The implementation of other protocols generally uses hooks in the IPv4 forwarding call flow. IPv4 forwarding has to be designed in such a way that the impact of stacking protocols is minimized. A stand alone IPv4 forwarding implementation will always be more efficient but is useless for most of the applications.
Providing IPv4 forwarding benchmarks makes sense to start with but, it does not provide a real idea of overall performances. Practical NGN use cases show that you need to at least implement VLAN, L3 encapsulation, IPsec, GRE, IPinIP… and apply a firewall or a NAT rule. If we consider all the protocols, IPv4 forwarding accounts for 10-20% of the overall processing capabilities.
You will find hereafter some ideas of the relative processing weight resources we have measured to implement usual protocols in a Fast Path implementation compared to IPv4 forwarding that counts for 1:
- IP Forwarding: 1 (integrated with all other protocols)
- VLAN Tagging: 0.19
- VLAN Untagging: 0.24
- IP in IP Encapsulation (and more generally L3 encapsulation): 0.35
- IP in IP Decapsulation (and more generally L3 decapsulation): 0.3
- IPsec Encapsulation using AES and SHA1: 2.1 (with crypto-engines assist)
- IPsec Decapsulation AES and SHA1: 1.9 (with crypto-engines assist)
- Firewall, NAT: 0.5
These numbers only measure the Fast Path part of the protocol knowing complex signalling packets are handled either by the OS stack or Control Plane protocols. The processing time necessary to synchronise the Fast Path with the OS stack and the Control Plane protocols is also taken into account.
I give only average numbers and the real performance should take into account more parameters like table sizes, packet length, etc. but, this gives a far better idea of achievable performances compared to a stand-alone IPv4 “optimized” benchmark.
More information about 6WINDGate architecture can be found here.
You can download a detailed Application Note “Multicore Meets Growing Demand for High Performance Packet Processing” here.
You can check 6WINDGate FAQ here.
Eric,
It would be good to have an incremental performance numbers. For example, assuming that IP forwarding is in, what would be the impact of tunnel encapsulation. Can it be that your numbers reflect exactly such concept?
Alex,
Thanks for your comment. As I mention in my post, implementing additional protocols is done by adding some hooks in the IPv4 (or IPv6) call flow. As a consequence, these hooks add some processing time. The numbers I gave refer to a complete IPv4 implementation (with all hooks). So if we take an example of VLAN untag + IPv4 + VLAN tag, the performance with the 3 protocols is the performance of IPv4 forwarding alone divided by 1.43 (1 + 0.19 + 0.24).
If the IPv4 forwarding implementation has not been designed to correctly stack protocols, adding new protocols will have an impact on forwarding. The incremental approach is not valid anymore because of two issues: IPv4 forwarding will take more than “1” and the other costs for VLAN aren’t meaningful anymore because they cannot be compared with the initial value.
Eric,
Thank you for better explanation. Let us take one example that readers will understand reasons behind these numbers. We will take relatively simple case of crypto decapsulation and then IPv4 forwarding. Based on your formula, the performance will be the performance of IPv4 forwarding divided by 2.9 (1 + 1.9), so about a third of just IPv4 forwarding performance, and all that taking into account that decryption is done in the specialized hardware.
There is, however, a feeling, or probably misconception, that overhead for IPSec tunnels is not very high with hardware crypto engines.
Usual flow that has been claimed is the following:
1) Extract security association and send the packet to crypto block for decryption;
2) Meanwhile do something else or process another packet until the decryption result is ready;
3) Receive the decrypted result and perform IPv4 forwarding.
Could you, please, clarify the processing sequence and point out the reasons for the performance degradation almost 3x?
Alex,
you are doing the right computation for IPsec, you got the point about the formula. So, as you understand, we do not have to do the sum of percentage, but the sum of those costs. Percentages cannot be summed because they are relative numbers.
From the 3x degration, the 1 part is uncompressible. Let’s review the 2 part (1.9 in fact).
For the “1.9″ related to IPsec Decap, we have to break down the “1.9″ into the full set of processing which is:
a- extract ESP/AH header
b- SADB lookup per VRF
c- check anti-replay window
d- remove IP/[UDP]/IPsec header of the decapsulated packet
e- call crypto engines (inline to the core or external to the core)
f- update IPsec stats of the SAs for volume based expiration and for the MIB
g- update the antireplay window for HA+NSF (Non Stop Forwarding) needs
h- check and remove the padding
i- check the IPv4 header of the decapsulated packet, including its IPv4 checksum (HW engines cannot check this checksum because the packet was encrypted) and its DSCP/TOS.
j- check IPsec cross VR
k- do a policy check of the decapsulated packet
l- reset HW flags of the packet buffer because the HW packet parsing engines where applied only on encrypted packet
m- do a forwarding, including the update of the TTL (please, note that the first forwarding of cost “1″ was related to the outer header of the encrypted packet and this 2nd forwarding is much less costly since it does not have to include any IO).
1.9 = a+b+c+d+e+f+g+h+i+j+k+l+m
Our findings are that the design must remain linear so, if you add processing, it must not degrade this a+…+m. It must only become a+…+m+n where n is your new processing. It seems to be obvious when you think with per packet cycle counts, but since those cycles are related to function calls, it was not so obvious in fact. Moreover, those cycles are related to the CPUs (some can execute 1 issue per clock, some others 2 issues per clock, etc.); so a linear cost is required. I hope we are defining the right relative and comparable unit. TBC?!
Eric’s comment states that (1) the cost of a, …, m MUST remain constant whatever you would be adding even when IPsec is performed: for instance, if you protect L2TP by IPsec tunnels, IPsec’s cost must remain the same, 1.9; and that (2) one must identify all the costs of a, … to m so dataplanes can be compared to other dataplanes by just removing (‘-’ instead of ‘+’) the cost that a dataplane is not implementing.
However, we will still have to measure the values of a, …, m. About “e”, there are multiple combinations of values for (every crypto algorithm X authentication algorithm X inline CPU security engine or core external security engine). The 1.9 of “e” is just for AESxSHA1xinline_CPU_security_engine. There are 24 different “e” values that should be applied for the complete combination (AES, DES, 3DES)x(SHA1, SHA2, MD5, AES-XCBC)x(2).
It can be noted that in case of the 6WINDGate’s Fast Path, we have already optimized to get “asynchronous” processing for any CPU so we try to keep the execution units of the cores busy while the security engines are working.
Vincent,
This was perfect response, thank you. The only thing that I wanted to add is that in some newer architectures (at least based on vendor announcements) it is possible to send decrypted packet back to hardware classification instead of sending to the core. This feature can actually save some processing. Some processors do not have any hardware classification blocks, everything is done in software, so overhead in these systems will be probably even higher.
I just wanted for readers to understand three important points:
1. Hardware crypto engine doesn’t mean that IPSec decryption is free, and I’ve seen some wrong interpretations here.
2. Proposed numbers were measured on a particular hardware architecture, and the number can be lower or higher depending on the selected processor and architecture.
3. Selection of platform software, including IP and IPSec stacks, optimized for the SoC (not only the ISA, such as x86, PPC, MIPS or ARM) is very important to achieve maximum performance.