By Mark Litvack, Sr. Director Business Development, NetLogic Microsystems
Overview
To meet the needs of next-generation systems and applications while continuing to provide improved performance, new processor designs must incorporate innovative architectural concepts. Multithreading is one of the most efficient techniques for enabling higher performance for this new class of processors. The following describes the advantages of multithreading and the flexibility of its implementation to software developers. The simplest programming approach uses the parallel method with independent packet processing occurring simultaneously on different threads. The parallel approach minimizes software impacts to existing single image code and the processor threading hides memory latencies due to cache misses during table searches.
Traditional Processor Design
Many commercially available integrated embedded system processors shipping today are designed using a single-threaded architecture, which is performance and application limited by today’s standards. As applications are becoming more and more network-centric, this legacy processor design approach fails to address the throughput requirements of today’s converging compute and networking paradigm. This evolving packet-oriented environment is characterized by high memory access latencies, which are not effectively managed by conventional processor architectures. This weakness can severely impact processor performance and workload efficiency. When a memory access cannot be serviced immediately and no additional instructions are ready to be executed, conventional processors stall and waste valuable processing cycles. In the following section, a simple graphical simulation of architectural design is used to illustrate the problems with current designs and the effectiveness of an intelligent multithreaded approach.
Single-Thread Processors
Figure 1 illustrates typical cycle use and inefficiencies in a single-threaded processor. In this example, four packets, each represented by a unique color and indicated by “Px”, are processed in order. As cache misses naturally occur, the resulting memory access latencies cause wasted processing cycles. The total amount of time to process the workload in this illustration is 103 cycles of which 48 cycles are spent in useful work, resulting in less than 50% utilization of the pipeline.

Single Thread throughput
Pre-fetching of data may help reduce wasted cycles. However, useful pre-fetching assumes that there are other independent instructions that can be issued and processed while memory accesses are completed. This is not often the case. In addition, some code is simply not amenable to pre-fetching.
The Multithreaded Approach
The approach of adding more simple CPU cores and / or increasing a processor’s superscalar width (greater rates of multi-issue) has obvious limitations in the packetized environment. The inefficiencies illustrated in the above example are much more effectively addressed by a multithreaded architecture, which aims to increase total workload in a given amount of time. This approach takes advantage of, and more fully exploits, the packet level parallelism commonly found in today’s converging compute and networking applications. Memory latencies can be effectively mitigated by a well-designed multithreaded processor, thus dramatically improving overall throughput. With this design, when one thread becomes inactive while waiting for memory data to return, other threads can continue to efficiently process instructions. This maximizes processor resources by minimizing or even completely eliminating the wasted cycles inherent in conventional processors. This workload efficiency improvement is shown in Figure 2. Note in this 4-way threaded example, a packet is operated on every fourth cycle. In the clock cycle immediately following that used for packet 1 (yellow), thread 2 can begin operating on packet 2 (blue). Likewise, thread 3 (green) can immediately follow thread 2 and thread 4 (red) can immediately follow thread 3. This process will continue on a round robin basis among the 4 separate threads and corresponding packets. The benefit is most apparent in the event of a cache miss, during which time useful work can be applied to other independent packets while other threads wait for memory. Here, the memory latencies are effectively hidden and workload efficiency is highly optimized.

Multithreaded Throughput
The example above in Figure 2 demonstrates the tremendous advantage of multithreaded architectures. A thread in a processor is capable of achieving outstanding performance without a software paradigm change. If desired, software need not have any knowledge of threads and can simply treat the system as individual processors connected coherently in a shared memory environment. For example, an operating system such as Linux requires no software changes to allow it to run as a multi-way SMP operating system on a multi-threaded processor. This allows software engineers to develop code while leveraging the same standard tool chains and operating systems that have been used in past development efforts.
Figure 2 clearly demonstrates the multicore advantage. If there is one core running the 4 tasks as indicated by the 4 colors, the overall throughput is the lower bar in figure 2. If there is one core with multi-threading, the best case throughput is the upper bar in figure 2.
If there are four cores running these 4 tasks, the throughput is more than 2x the multi-threading case. In figure 2, you can visualize it by just looking at the yellow boxes of the lower bar. There would be four bars which correspond to four cores each running one of the tasks (colors). In this 4 core scenario, the length of the bar is less than half of the multi-threading (upper) bar.
This shows that an actual physical core always provides more hardware resources and performance than having multiple threads share one physical core and compete for resources.
You will find another point of view regarding multi-threading here.
With all due respect, Mr. Liu, your analysis favorably assumes that you can speed-up the processing just by throwing more cores at it, or that latency is avoided just because you parallelized a solution.
The figures while not drawn to scale, appears to show linearly progressing time where latency is non-zero. If a packet requires even just a single memory reference, it still takes X cycles for it complete from DRAM even if you throw 4 cores or 8 cores, or even 100 cores at it.
The second presumption you’re making is that the 4 yellow bars are inherently parallelizable into 4 cores: while you can refactor or decompose a problem domain into parallel contexts, it is not always the case. This is the mythical man month problem–if Y1, Y2, Y3 and Y4 represent the yellow bars above, they still have the associated latency for their corresponding memory accesses: they are NEVER zero. If they are stalled, perhaps some other context (say Violet V1, V2, V3 …) can consume the pipeline efficiently.
To me, multi-threading is just one of the evolving architectural choices that can provide benefits at the same time with multi-core. In the past ten years, given that no single micro-architecture advance has yielded concomitant scale with Moore’s law, the natural tendency seems to exploit multiple architectural choices simultaneously (e.g. smaller geometry, larger and multi-level caches, deeper pipeline, multi-core, and of course, multi-thread) and this is a developing trend.
The domain of interpretation for the original article appears to be within the context of a single-core, in a multi-core processing complex. Numerous processor vendors have realized the potential benefits of offering multi-threading along with multi-core, although I see your comment as defensive with possibly not having such a feature in your micro-architecture.
-TG
TG, I believe you misread the above post. I realize this is old and the original posters won’t read this, but TG appears to be very knowledgeable so people will tend to agree with his statement. But he incorrectly interprets Kin’s statement as breaking single threads up across four cores.
If that were the case he would be correct. Actually, all Kin is saying is that each core handles a single thread in its entirety. When he says “colored bar” he doesn’t mean each segment of color he means the whole bar that represents a timeline of a single core. In fig 2 there are 2 bars total, the top and bottom. With four cores, there would be four parallel bars for a given time period, a yellow and white, a blue and white, and so on. And Kin is correct they would all be shorter than the multithreaded bar because they only do a single thread.