Posts Tagged ‘multi-threading’
By Alexander Bachmutsky – Chief Architect at Nokia Siemens Networks
This post is the continuation of the discussion started by Mark Litvack from Netlogic regarding hardware multithreading (refer this post).
First of all, I would like to clarify that this post does not mean that the referred article is in any way incorrect. It is correct, and I agree with it fully. However, it makes an impression that hardware multithreading has only advantages and that multiple threads are equal to multiple cores. This article tries to create more balanced view of the feature.
While theoretically the second thread can almost double the performance of the core, it never happens in real applications. Our measurements for telecommunication applications were consistent with other similar measurements, and they gave the very best result in some applications to be at most 40% performance improvements for the second thread and 8%-10% for third and fourth threads. We can expect that additional hardware threads will have even less impact. On the other side, the main overhead of the multithreaded processor implementation is between a single-threaded and dual-threaded core, and depending on the overhead of 3rd and 4th threads the performance improvements could be justifiable.
Together with all that, it has to be also clear that hardware multithreading is not coming for free, it complicates the core and takes more real estate on the die. If we take the die size as a constant number, not having multithreaded cores can either allow more cores on the same die, or more complex cores with improved performance, or additional hardware offload blocks for some other functions. A simplified view is that if multithreading increases the core size by 10%, we have to reduce above benefits correspondingly, so instead of 40% improvements we actually have only 30% in our particular example. The real world view is more complex, it is not 1:1 tradeoff, but the overall message is that this pure hardware implementation view has to be taken into account.
Another important item to understand is the difference between the second thread and additional core. It is true that from the Symmetric Multi-Processing software point of view, such as Linux SMP, they look the same. However, the developer has to take into account that multiple hardware threads on the same core share common instruction and data L1 caches and L2 cache if the core architecture has its dedicated L2 cache (in other architectures L2 cache is shared between a group of cores or even all cores on the chip). Cache sharing can have a significant impact, because in most cases all hardware threads are running different software threads with different instructions and different data. If the L1 cache is fully dynamic, one thread can use its major part, which will mean that switching to the next thread will start with instruction and/or data misses causing significant performance degradation and practically diminishing all multithreading benefits. One potential improvement is to load L1 caches automatically by the hardware when switching to another thread (I am not aware of this functionality in any existing processor), but cache line loads takes some time, meaning that the thread switching time becomes longer eating into the performance improvement benefits. Usual solution to the problem is to either statically divide the cache between all threads, or at least statically configure the minimum amount of cache guaranteed for every thread with the rest of the cache memory available for dynamic use by all threads. This method gives much more predictable results, but it has also one significant disadvantage: each thread will have less L1 cache available for it. It is important to emphasize that Linux, for example, is very sensitive to L1 cache size (especially instruction cache, but data cache can also become a bottleneck when Linux is dealing with large amount of sessions, such as TCP sessions), and making this cache smaller immediately affects performance. Therefore, one of important parameters is always the amount of cache per execution thread.
Let us also differentiate between multiple Linux threads of the same process and multiple Linux processes running on multiple hardware threads. Multiple processes could be even worse than multiple threads, because multiple processes frequently have separated memory blocks bringing stress on one more shared resource – TLB tables. It becomes especially visible when multiple operating system instances or different operating systems are using multiple threads with or without the hypervisor. It is true that from software point of view hypervisor does not have to differentiate between multiple threads and multiple cores, but practice shows that it is not the brightest idea to run multiple virtual machines on different execution threads.
There are, of course, other parameters to check when considering multithreaded cores. One example could be the pipeline implementation, thread switching time and others. Just to clarify it, not all implementations are equal, and some will provide more benefits than others.
One more parameter in this discussion is the operating system used and application itself. If we take as an example simple executive type of OS, which is usually applicable more for data plane applications, with run-till-completion small code that fully fits in caches, the multithreading benefits will be close to nothing. Of course, in real applications it is rarely the case, but it is correct statement that multithreading impact for control and management planes is much higher than for data plane.
To summarize this article, hardware multithreading can indeed bring some benefits; they will be more significant for some applications and less for others. It is highly recommended that system architects test their particular scenario when selecting between multi-threaded and single-threaded approaches.