By Alexander Bachmutsky – Chief Architect at Nokia Siemens Networks

This post is the continuation of the discussion started by Mark Litvack from Netlogic regarding hardware multithreading (refer this post).

First of all, I would like to clarify that this post does not mean that the referred article is in any way incorrect. It is correct, and I agree with it fully. However, it makes an impression that hardware multithreading has only advantages and that multiple threads are equal to multiple cores. This article tries to create more balanced view of the feature.

While theoretically the second thread can almost double the performance of the core, it never happens in real applications. Our measurements for telecommunication applications were consistent with other similar measurements, and they gave the very best result in some applications to be at most 40% performance improvements for the second thread and 8%-10% for third and fourth threads. We can expect that additional hardware threads will have even less impact. On the other side, the main overhead of the multithreaded processor implementation is between a single-threaded and dual-threaded core, and depending on the overhead of 3rd and 4th threads the performance improvements could be justifiable.

Together with all that, it has to be also clear that hardware multithreading is not coming for free, it complicates the core and takes more real estate on the die. If we take the die size as a constant number, not having multithreaded cores can either allow more cores on the same die, or more complex cores with improved performance, or additional hardware offload blocks for some other functions. A simplified view is that if multithreading increases the core size by 10%, we have to reduce above benefits correspondingly, so instead of 40% improvements we actually have only 30% in our particular example. The real world view is more complex, it is not 1:1 tradeoff, but the overall message is that this pure hardware implementation view has to be taken into account.

Another important item to understand is the difference between the second thread and additional core. It is true that from the Symmetric Multi-Processing software point of view, such as Linux SMP, they look the same. However, the developer has to take into account that multiple hardware threads on the same core share common instruction and data L1 caches and L2 cache if the core architecture has its dedicated L2 cache (in other architectures L2 cache is shared between a group of cores or even all cores on the chip). Cache sharing can have a significant impact, because in most cases all hardware threads are running different software threads with different instructions and different data. If the L1 cache is fully dynamic, one thread can use its major part, which will mean that switching to the next thread will start with instruction and/or data misses causing significant performance degradation and practically diminishing all multithreading benefits. One potential improvement is to load L1 caches automatically by the hardware when switching to another thread (I am not aware of this functionality in any existing processor), but cache line loads takes some time, meaning that the thread switching time becomes longer eating into the performance improvement benefits. Usual solution to the problem is to either statically divide the cache between all threads, or at least statically configure the minimum amount of cache guaranteed for every thread with the rest of the cache memory available for dynamic use by all threads. This method gives much more predictable results, but it has also one significant disadvantage: each thread will have less L1 cache available for it. It is important to emphasize that Linux, for example, is very sensitive to L1 cache size (especially instruction cache, but data cache can also become a bottleneck when Linux is dealing with large amount of sessions, such as TCP sessions), and making this cache smaller immediately affects performance. Therefore, one of important parameters is always the amount of cache per execution thread.

Let us also differentiate between multiple Linux threads of the same process and multiple Linux processes running on multiple hardware threads. Multiple processes could be even worse than multiple threads, because multiple processes frequently have separated memory blocks bringing stress on one more shared resource – TLB tables. It becomes especially visible when multiple operating system instances or different operating systems are using multiple threads with or without the hypervisor. It is true that from software point of view hypervisor does not have to differentiate between multiple threads and multiple cores, but practice shows that it is not the brightest idea to run multiple virtual machines on different execution threads.

There are, of course, other parameters to check when considering multithreaded cores. One example could be the pipeline implementation, thread switching time and others. Just to clarify it, not all implementations are equal, and some will provide more benefits than others.

One more parameter in this discussion is the operating system used and application itself. If we take as an example simple executive type of OS, which is usually applicable more for data plane applications, with run-till-completion small code that fully fits in caches, the multithreading benefits will be close to nothing. Of course, in real applications it is rarely the case, but it is correct statement that multithreading impact for control and management planes is much higher than for data plane.

To summarize this article, hardware multithreading can indeed bring some benefits; they will be more significant for some applications and less for others. It is highly recommended that system architects test their particular scenario when selecting between multi-threaded and single-threaded approaches.

VN:F [1.9.1_1087]
Rating: 7.0/10 (6 votes cast)
VN:F [1.9.1_1087]
Rating: +4 (from 4 votes)
Disadvantages of Multi-Threading in Next-Generation Multicore Processors, 7.0 out of 10 based on 6 ratings

4 Responses to “Disadvantages of Multi-Threading in Next-Generation Multicore Processors”

  • Kin-Yip Liu:

    One additional aspect to consider when developing application on multi-threaded cores is performance and latency determinism. The effect of threads competing for the shared per-core caches and cache pollution that this post has mentioned means that the performance and latency of completing the tasks that a thread executes is much less deterministic as compared to when only one thread executes on an entire core. In the latter case, the thread owns all the resources that the core has to offer.

    There are some other factors which reduce deterministic performance with multi-threading. First, if the processor hardware decides when to switch thread, then software developer does not control when a thread is executed and for how long, before hardware switches the execution to another thread. Second, even if hardware tries to run all the threads at the same time, these threads also compete for the same execution units. It is not always clear to the software developer how the hardware allocates the execution units among multiple threads being executed. As a result, performance determinism gets impacted.

    Performance determinism is an important performance attribute for packet processing. Throughput is not the only important factor.

    VA:F [1.9.1_1087]
    Rating: 4.0/5 (4 votes cast)
    VA:F [1.9.1_1087]
    Rating: +3 (from 3 votes)
  • Tatiana Griffin:

    The two respondents above make a sparring analysis as to why “multi-threading” is disadvantageous to “multi-core” or vis-a-vis, why multi-core is “better” than multi-threading. Mr. Litvack’s original article appears to be about concurrent and simultaneous utilization of BOTH multi-core and multi-threading. It does not preclude multi-core, and in fact the title states it clearly: “multi-threading in [next-generation] multi-core processors.”

    All multi-threaded processors in fact support threading on a multi-core architecture: the duality in choice allows native selection and concurrent use of both ILP and TLP. The original article appears to address core performance within a single-core, not across multiple cores. Clearly, one could make a collective argument about “Disadvantages of multi-core in next generation packet processing” and that would also be jaded.

    I encourage readers to visit http://www.cs.washington.edu/research/smt/ for an unbiased approach to multi-threading, in the specific context of where it could be a favorable design choice in concordance with multi-core selection.

    Multi-threading is not panacea and neither is multi-core; multi-threading can be a distinct benefit in a variety of use cases. Many modern core architectures (MIPS-MT, POWER5, UltraSPARC, Nehalem etc.) and many vendors (MIPS, NetLogic, IBM, Sun Microsystems, Intel) all support multi-threading simultaneously with multi-core.

    Multi-core is here to stay. And so is multi-threading.

    You need not pick between the two: you can have both, have your cake and eat it too.

    -TG

    VA:F [1.9.1_1087]
    Rating: 2.0/5 (4 votes cast)
    VA:F [1.9.1_1087]
    Rating: 0 (from 4 votes)
  • Mark Guinther:

    Alexander makes a good point about the applicability of multiple threads. To generalize, multi-threading can realize valuable peformance increases in multi-tasking environments. The more context switches you have, the more benefit you will see from multi-threading. Conversely, the benefits of hyperthreading will decrease with the relative percentage of context switches a core can make. In multicore packet processing, cores can be separated between control plane and data plane functionality. The control plane is typically a multitasking OS like Linux, where hyperthreading benefits can be demonstrated easily.

    On the data plane, a much simpler executive can be used. In a run-to-completion model an individual core (which I’ll call a Network Acceleration Engine, NAE) can poll for packets(incoming our outgoing), perform the necessary processing, queue for dispatch, and return to polling state. In this case, multiple threads are not necessarily going to make the NAE run more efficiently. If the NAE is I/O or bus bound, the system performance will be poor regardless of the processing power. Likewise cache stalls will kill throughput performance, even if a second thread is available to prevent the core from going idle.

    I agree that multithreading is a brilliant and useful concept for most OS environments. But in the particular case of network packet processing, the demands on all cores are not symmetrical.

    VA:F [1.9.1_1087]
    Rating: 3.0/5 (4 votes cast)
    VA:F [1.9.1_1087]
    Rating: +2 (from 2 votes)
  • Tatiana Griffin:

    I fail to understand why there is a presupposition that in the data plane that you can always complete your processing with 100% (or near 100%) pipeline efficiency. A data plane engine can poll, process, queue and repeat, but there is no presumption that those all have zero waits or no dependencies.

    Let’s take for example just a simple L3 forwarding case: there are two essential lookups, one for the route to yield the next-hop, and another for the link-layer address of the next-hop. When you have thousands of routes in hundreds of VRF contexts, you have no semblance of assured data availability (even if you were to implement some intelligent pre-fetching).

    There is a misconception (which befuddles me) that multi-threading only benefits multi-tasking at an OS level; this is in fact far from the truth. Multi-threading actually benefits packet processing environments in a much larger scale, because you cannot avoid latency specifically in run-to-completion models.

    Conversely, if you presume that there is never any idle cycle, then a 3 GHz core can conceptually outperform 3 individual 1 GHz cores because a 3 GHz core could finish the job in a third of the time.

    Clearly, that is never the case because in packet processing that today involves upwards of 10 lookups/packet with hundreds of thousands, if not millions of entries per table, controlling (or predicting) pipeline behavior on a cycle-by-cycle basis is impossible.

    The one postulate I can agree with, is that the demands on all cores or threads are not symmetrical: this is exactly a source of advantage for a hardware thread with a single core where even if all threads are executing the same instruction stream, they are not executing the exact same instruction (or operating on the same object) at a given cycle.

    You can also surely run a lightweight run-to-completion model with a multi-threaded core, and certainly is the better way to accomplish ideal performance. Linux is horrible as a packet processing framework, so it generally doesn’t behoove associating it with any performance or throughput related discussion.

    I would be curious to see a multi-core vendor backup the hypothesis that the run-to-completion model somehow allows them to reach 100% pipeline efficiency in packet processing environments.

    -TG

    VA:F [1.9.1_1087]
    Rating: 5.0/5 (1 vote cast)
    VA:F [1.9.1_1087]
    Rating: +1 (from 1 vote)

Leave a Reply