Posts Tagged ‘multi-threading’

By Mark Litvack, Sr. Director Business Development, NetLogic Microsystems

Overview
To meet the needs of next-generation systems and applications while continuing to provide improved performance, new processor designs must incorporate innovative architectural concepts. Multithreading is one of the most efficient techniques for enabling higher performance for this new class of processors. The following describes the advantages of multithreading and the flexibility of its implementation to software developers. The simplest programming approach uses the parallel method with independent packet processing occurring simultaneously on different threads. The parallel approach minimizes software impacts to existing single image code and the processor threading hides memory latencies due to cache misses during table searches.

Traditional Processor Design
Many commercially available integrated embedded system processors shipping today are designed using a single-threaded architecture, which is performance and application limited by today’s standards. As applications are becoming more and more network-centric, this legacy processor design approach fails to address the throughput requirements of today’s converging compute and networking paradigm. This evolving packet-oriented environment is characterized by high memory access latencies, which are not effectively managed by conventional processor architectures. This weakness can severely impact processor performance and workload efficiency. When a memory access cannot be serviced immediately and no additional instructions are ready to be executed, conventional processors stall and waste valuable processing cycles. In the following section, a simple graphical simulation of architectural design is used to illustrate the problems with current designs and the effectiveness of an intelligent multithreaded approach.

Single-Thread Processors
Figure 1 illustrates typical cycle use and inefficiencies in a single-threaded processor. In this example, four packets, each represented by a unique color and indicated by “Px”, are processed in order. As cache misses naturally occur, the resulting memory access latencies cause wasted processing cycles. The total amount of time to process the workload in this illustration is 103 cycles of which 48 cycles are spent in useful work, resulting in less than 50% utilization of the pipeline.

Figure 1

Figure 1

Single Thread throughput
Pre-fetching of data may help reduce wasted cycles. However, useful pre-fetching assumes that there are other independent instructions that can be issued and processed while memory accesses are completed. This is not often the case. In addition, some code is simply not amenable to pre-fetching.

The Multithreaded Approach
The approach of adding more simple CPU cores and / or increasing a processor’s superscalar width (greater rates of multi-issue) has obvious limitations in the packetized environment. The inefficiencies illustrated in the above example are much more effectively addressed by a multithreaded architecture, which aims to increase total workload in a given amount of time. This approach takes advantage of, and more fully exploits, the packet level parallelism commonly found in today’s converging compute and networking applications. Memory latencies can be effectively mitigated by a well-designed multithreaded processor, thus dramatically improving overall throughput. With this design, when one thread becomes inactive while waiting for memory data to return, other threads can continue to efficiently process instructions. This maximizes processor resources by minimizing or even completely eliminating the wasted cycles inherent in conventional processors. This workload efficiency improvement is shown in Figure 2. Note in this 4-way threaded example, a packet is operated on every fourth cycle. In the clock cycle immediately following that used for packet 1 (yellow), thread 2 can begin operating on packet 2 (blue). Likewise, thread 3 (green) can immediately follow thread 2 and thread 4 (red) can immediately follow thread 3. This process will continue on a round robin basis among the 4 separate threads and corresponding packets. The benefit is most apparent in the event of a cache miss, during which time useful work can be applied to other independent packets while other threads wait for memory. Here, the memory latencies are effectively hidden and workload efficiency is highly optimized.

Figure 2

Figure 2

Multithreaded Throughput
The example above in Figure 2 demonstrates the tremendous advantage of multithreaded architectures. A thread in a processor is capable of achieving outstanding performance without a software paradigm change. If desired, software need not have any knowledge of threads and can simply treat the system as individual processors connected coherently in a shared memory environment. For example, an operating system such as Linux requires no software changes to allow it to run as a multi-way SMP operating system on a multi-threaded processor. This allows software engineers to develop code while leveraging the same standard tool chains and operating systems that have been used in past development efforts.

VN:F [1.9.6_1107]
Rating: 6.8/10 (6 votes cast)
VN:F [1.9.6_1107]
Rating: +1 (from 5 votes)
Subscribe to the Forum
Categories
Archives