Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Real World Multicore Embedded Systems
Real World Multicore Embedded Systems
Real World Multicore Embedded Systems
Ebook965 pages9 hours

Real World Multicore Embedded Systems

Rating: 3 out of 5 stars

3/5

()

Read preview

About this ebook

This Expert Guide gives you the techniques and technologies in embedded multicore to optimally design and implement your embedded system. Written by experts with a solutions focus, this encyclopedic reference gives you an indispensable aid to tackling the day-to-day problems when building and managing multicore embedded systems.

Following an embedded system design path from start to finish, our team of experts takes you from architecture, through hardware implementation to software programming and debug.

With this book you will learn:

• What motivates multicore

• The architectural options and tradeoffs; when to use what

• How to deal with the unique hardware challenges that multicore presents

• How to manage the software infrastructure in a multicore environment

• How to write effective multicore programs

• How to port legacy code into a multicore system and partition legacy software

• How to optimize both the system and software

• The particular challenges of debugging multicore hardware and software

  • Examples demonstrating timeless implementation details
  • Proven and practical techniques reflecting the authors’ expertise built from years of experience and key advice on tackling critical issues
LanguageEnglish
Release dateFeb 27, 2013
ISBN9780123914613
Real World Multicore Embedded Systems

Related to Real World Multicore Embedded Systems

Related ebooks

Hardware For You

View More

Related articles

Reviews for Real World Multicore Embedded Systems

Rating: 3 out of 5 stars
3/5

1 rating1 review

What did you think?

Tap to rate

Review must be at least 10 words

  • Rating: 3 out of 5 stars
    3/5
    Quite repetitive content, as multiple authors are projected on same concepts subjectively.

    The book serves as a good reference tool for people experienced and new to working with multi core.

Book preview

Real World Multicore Embedded Systems - Bryon Moyer

1

Introduction and Roadmap

Bryon Moyer, Technology Writer and Editor, EE Journal

Chapter Outline

Multicore is here

Scope

Who should read this book?

Organization and roadmap

Concurrency

Architecture

High-level architecture

Memory architecture

Interconnect

Infrastructure

Operating systems

Virtualization

Multicore-related libraries

Application software

Languages and tools

Partitioning applications

Synchronization

Hardware assistance

Hardware accelerators

Synchronization hardware

System-level considerations

Bare-metal systems

Debug

A roadmap of this book

Multicore is here

Actually, multicore has been around for many years in the desktop and supercomputing arenas. But it has lagged in the mainstream embedded world; it is now here for embedded as well.

Up until recently, multicore within embedded has been restricted primarily to two fields: mobile (assuming it qualifies as an embedded system, a categorization that not everyone agrees with) and networking. There have been multiple computing cores in phones for some time now. However, each processor typically owned a particular job – baseband processing, graphics processing, applications processing, etc. – and did that job independently, with little or no interaction with other processing cores. So multicore really wasn’t an issue then. That’s changed now that application processors in smartphones have multiple cores: it’s time to start treating them as full-on multicore systems.

Meanwhile, networking (or, more specifically, packet-processing) systems have used multicore for a long time, well before any tools were available to ease the multicore job. This has been a highly specialized niche, with rockstar programmers deeply imbued with the skills needed to extract the highest possible performance from their code. This has meant handcrafting for specific platforms and manual programming from scratch. This world is likely to retain its specialty designation because, even as multicore matures, the performance requirements of these systems require manual care.

For the rest of embedded, multiple cores have become an unavoidable reality. And multicore has not been enthusiastically embraced for one simple reason: it’s hard. Or it feels hard. There’s been a huge energy barrier to cross to feel competent in the multicore world.

Some parts of multicore truly are hard, but as it reaches the mainstream, many of the issues that you might have had to resolve yourself have already been taken care of. There are now tools and libraries and even changes to language standards that make embedded multicore programming less of a walk in the wild.

And that’s where this book comes in. There have been people quietly working for years on solving and simplifying multicore issues for embedded systems. And let’s be clear: what works for desktops is not at all acceptable for embedded systems, with their limitations on size, resources and power. It has taken extra work to make some of the multicore infrastructure relevant to embedded systems.

Some of the people involved in those processes or simply with experience in using multicore have contributed from their vast knowledge to help you understand multicore for embedded. Most importantly, the intent is that, by taking in the various topics we’ve covered, you’ll cross over that energy barrier and be able to start doing the multicore work you need to do.

Scope

The term embedded system is broad and ill-defined. You probably know it when you see it, although community standards may vary. We won’t try to define what is included; it’s probably easier to say what isn’t included:

– desktop-style general computing (although desktop computers are sometimes harnessed for use in embedded applications)

– high-performance computing (HPC), the realm of supercomputers and massively parallel scientific and financial computing.

Many of the concepts discussed actually apply to both of those realms, but we will restrict examples and specifics to the embedded space, and there will be topics (like MPI, for example) that we won’t touch on.

Who should read this book?

This book is for anyone that will need to work on embedded multicore systems. That casts a wide net. It includes:

• Systems architects that are transitioning from single-core to multicore systems.

• Chip architects that have to implement sophisticated systems-on-chips (SoCs).

• Software programmers designing infrastructure and tools to support embedded multicore.

• Software programmers writing multicore applications.

• Software programmers taking sequential programs and rewriting them for multicore.

• Systems engineers trying to debug and optimize a multicore system.

This means that we deal with hardware, firmware/middleware, software, and tools. The one area we don’t deal with is actual hardware circuit design. We may talk about the benefits of hardware queuing, for example, but we won’t talk about how to design a hardware queue on a chip.

We have assumed that you are an accomplished engineer with respect to single-core embedded systems. So we’re not going to go too far into the realm of the basic (although some review is helpful for context). For example, we won’t describe how operating systems work in general – we assume you know that. We talk about those elements of operating systems that are specific to multicore.

Organization and roadmap

In order to give you the broad range of information that underpins multicore technology, we’ve divided the book into several sections. These have been ordered in a way that allows a linear reading from start to finish, but you can also dive into areas of particular interest directly if you have experience in the other areas. Each chapter is an independent essay, although we’ve tried to avoid too much duplication, so you will find some references from chapter to chapter.

Concurrency

We start with a review of the concept that underlies everything that matters in this book: concurrency. The reason everything changes with multicore is that our old assumption that one thing happens before another no longer holds true. More than one thing can happen at the same time, meaning we get more work done more quickly, but we also open up a number of significant challenges that we haven’t had to deal with before. Understanding concurrency in detail is important to making sense out of everything else that follows.

Architecture

The next section concedes the fact that hardware is expensive to design, therefore hardware platforms will be created upon which software will be written. It’s nice to think that embedded systems are a perfect marriage between purpose-built hardware and software that have been co-designed, but, in reality, especially when designing an expensive SoC, the hardware must serve more than just one application.

High-level architecture

This chapter takes a broad view of multicore hardware architecture as it applies to embedded systems. Author Frank Schirrmeister focuses on the arrangement of processing cores, but he must necessarily include some discussion of memory and interconnect as well. The intent of this chapter is to help you understand either how to build a better architecture or how to select an architecture.

Memory architecture

Memory can be arranged in numerous different ways, each of which presents benefits and challenges. In this chapter, author Gitu Jain picks up on the more general description in the high-level architecture chapter and dives deeper into the implications not only of main memory, but also of cache memory and the various means by which multiple caches can be kept coherent.

Interconnect

Processors and memory are important, but the means by which they intercommunicate can have a dramatic impact on performance. It can also impact the cost of a chip to the extent that simply over-provisioning interconnect is not an option. Sanjay Deshpande provides a broad treatment of the considerations and trade-offs that apply when designing or choosing an interconnect scheme.

Infrastructure

Once the hardware is in place, a layer of services and abstraction is needed in order to shield applications from low-level details. This starts with something as basic and obvious as the operating system, but includes specialized libraries supporting things like synchronization.

Operating systems

Operating systems provide critical services and access to resources on computing platforms. But with embedded systems, the number of options for operating systems is far larger than it is for other domains. Some of those options can’t deal with multicore; others can. Some are big and feature-rich; others are small or practically non-existent. And they each have different ways of controlling how they operate. So in this chapter, with the assistance of Bill Lamie and John Carbone, I discuss those elements of operating systems that impact their performance in multicore systems, including the ability of associated tools to assist with software-level debugging.

Virtualization

The increased need for security and robustness has made it necessary to implement varying levels of virtualization in embedded systems. Author David Kleidermacher describes the many different ways in which virtualization can be implemented, along with the benefits and drawbacks of each option.

Multicore-related libraries

The details of multicore implementation can be tedious and error-prone to put into place. A layer of libraries and middleware can abstract application programs away from those details, making them not only more robust and easier to write, but also more portable between systems that might have very different low-level features. Author Max Domeika takes us on a tour of those multicore-related resources that are available to help ease some of the burden.

Application software

The goal of a good multicore system is to make the underlying configuration details as irrelevant as possible to the application programmer. That’s less possible for embedded systems, where programmers work harder to optimize their software to a specific platform. But multicore raises specific new issues that cannot be ignored by programmers.

Languages and tools

Some languages are better than others at handling the parallelism that multicore systems make possible. That said, some languages are more popular than others without regard to their ability to handle parallelism. Author Gitu Jain takes us through a variety of languages, covering their appropriateness for multicore as well as their prevalence.

Meanwhile, tailoring an application for a multicore platform suggests analysis and tools that don’t apply for single-core systems. Author Kenn Luecke surveys a range of tools from a number of sources that are applicable to multicore design.

Partitioning applications

Since multicore systems can do more than one thing at a time, application programs that were once sequential in nature can be teased apart to keep multiple cores busy. But that process of splitting up a program can be very difficult. I’m assisted by Paul Stravers in this chapter that describes the issues surrounding partitioning and then shows you how to do it both manually and with new specialized tools designed to help.

Synchronization

The success of various pieces of a program running in parallel to yield a correct result depends strongly on good synchronization between those different pieces. This concept lies at the heart of what can make or break a multicore application. Author Tom Dickens runs through a litany of dos and don’ts for keeping an application program on track as it executes.

Hardware assistance

While there may appear to be a bright line defining what is expected in hardware and what should be in software, there are some gray areas. In some cases, hardware can assist with specific functions to improve performance.

Hardware accelerators

There are times when a hardware block can increase the performance of some compute-intensive function dramatically over what software can manage. Those accelerators can be designed to operate in parallel with the cores, adding a new concurrency dimension. In this chapter, Yosinori Watanabe and I discuss hardware accelerators and their considerations for integration into embedded multicore systems.

Synchronization hardware

Much of the bookkeeping and other details required for correct functioning of a multicore system is handled by low-level software services and libraries such as those described in the earlier chapter on multicore libraries. But hardware infrastructure can help out here as well – to the point of the processor instruction set having an impact. So author Jim Holt takes a look at some important low-level hardware considerations for improving embedded system performance.

System-level considerations

We close out with system-level optimization concepts: bare-metal systems and debugging. These chapters complement the preceding material, leveraging all of the concepts presented in one way or another.

Bare-metal systems

The performance of some systems – notably, the packet-processing systems mentioned at the outset of this introduction – is so critical that the overhead of an operating system cannot be tolerated. This suggests a completely different way of doing multicore design. Sanjay Lal and I describe both hardware and software considerations in such a system.

Debug

Finally, debug can be much more difficult in a multicore system than in a single-core one. It’s no longer enough to stop and start a processor and look at registers and memory. You have multiple cores, multiple clocks, multiple sets of interrupts and handlers, and multiple memories and caches, each of which may be marching to the beat of a different drummer. Successful debug relies heavily on the existence of robust control and observability features, and yet these features entail costs that must be balanced. Author Neal Stollon discusses those trade-offs and trends for multicore debug.

A roadmap of this book

Figure 1.1 illustrates the relationships between the various chapters and their relationship to system design. The particular depiction of the design process may feel oversimplified, and, in fact it is. But, as with everything embedded, each project and process is different, so this should be viewed as a rough abstraction for the purposes of depicting chapter relevance.

Figure 1.1 A rough depiction of embedded multicore system development.

Based on this model:

• Everyone should read the Concurrency chapter.

• System architects and designers can jump into the various platform-design-related chapters, with Architecture, Memory, and Interconnect being the most fundamental. Hardware synchronization is important if designing accelerated infrastructure.

• For those engineers adapting an OS or virtualization platform or writing drivers or adapting libraries, there are chapters specifically introducing those topics.

• For application writers, Partitioning and Synchronization are important chapters. Hardware accelerators are also included here because our focus is not on designing the hardware for the accelerator, but rather on using it within the system: invoking it in software, writing drivers for it, and handling the concurrency that it can bring. There is also an overview of languages and tools as they pertain to multicore to help make choices at the start of a project.

• Integration and verification engineers can tackle the Debug chapter.

• System designers and programmers trying to extract maximal performance can take on the Bare-metal chapter.

That said, everyone can benefit from the topics they’re not specifically involved in. A hardware designer can design a better platform if he or she understands the real-world issues that software programmers face either when writing an application or building firmware. A programmer writing an application can write better code if he or she understands the nuances and trade-offs involved in the platform that will run the code.

Our goal is to bring to you the experiences of those that have been tackling these issues before it was mainstream to do so. Each of their chapters encapsulates years of learning what works and what doesn’t work so that you don’t have to repeat those lessons. We hope that this serves you well.

Welcome to the age of embedded multicore.

Chapter 2

The Promise and Challenges of Concurrency

Bryon Moyer, Technology Writer and Editor, EE Journal

Chapter Outline

Concurrency fundamentals

Two kinds of concurrency

Data parallelism

Functional parallelism

Dependencies

Producers and consumers of data

Loops and dependencies

Shared resources

Summary

The opportunities and challenges that arise from multicore technology – or any kind of multiple processor arrangement – are rooted in the concept of concurrency. You can loosely conceive of this as more than one thing happening at a time. But when things happen simultaneously, it’s very easy for chaos to ensue. If you create an assembly line to make burgers quickly in a fast food joint, with one guy putting the patty on the bun and the next guy adding a dab of mustard, things will get messy if the mustard guy doesn’t wait for a burger to be in place before applying the mustard. Coordination is key, and yet, as obvious as this may sound, it can be extremely challenging in a complex piece of software.

The purpose of this chapter is to address concurrency and its associated challenges at a high level. Specific solutions to the problems will be covered in later chapters.

Concurrency fundamentals

It is first important to separate the notion of inherent concurrency and implemented parallelization. A given algorithm or process may be full of opportunities for things to run independently from each other. An actual implementation will typically select from these opportunities a specific parallel implementation and go forward with that.

For example, in our burger-making example, you could make burgers more quickly if you had multiple assembly lines going at the same time. In theory, given an infinite supply of materials, you could make infinitely many burgers concurrently. However, in reality, you only have a limited number of employees and countertops on which to do the work. So you may actually implement, say, two lines even though the process inherently could allow more. In a similar fashion, the number of processors and other resources drives the decision on how much parallelism to implement.

It’s critical to note, however, that a chosen implementation relies on the inherent opportunities afforded by the algorithm itself. No amount of parallelization will help an algorithm that has little inherent concurrency, as we’ll explore later in this chapter.

So what you end up with is a series of program sections that can be run independently punctuated by places where they need to check in with each other to exchange data – an event referred to as synchronization.

For example, one fast food employee can lay a patty on a bun completely independently from someone else squirting mustard on a different burger. During the laying and squirting processes, the two can be completely independent. However, after they’re done, each has to pass his or her burger to the next guy, and neither can restart with a new burger until a new one is in place. So if the mustard guy is a lot faster than the patty-laying guy, he’ll have to wait idly until the new burger shows up. That is a synchronization point (as shown in Figure 2.1).

Figure 2.1 Where the two independent processes interact is a synchronization point.

A key characteristic here is the fact that the two independent processes may operate at completely different speeds, and that speed may not be predictable. Different employees on different shifts, for example, may go at different speeds. This is a fundamental issue for parallel execution of programs. While there are steps that can be taken to make the relative speeds more predictable, in the abstract, they need to be considered unpredictable. This concept of a program spawning a set of independent processes with occasional check-in points is shown in Figure 2.2.

Figure 2.2 A series of tasks run mutually asynchronously with occasional synchronization points.

Depending on the specific implementation, the independent portions of the program might be threads or processes (Figure 2.3). At this stage, we’re really not interested in those specifics, so to avoid getting caught up in that detail, they are often generically referred to as tasks. In this chapter, we will focus on tasks; how those tasks are realized, including the definitions of SMP and AMP shown in the figure, will be discussed in later chapters.

Figure 2.3 Tasks can be different threads within a process or different processes.

Two kinds of concurrency

There are fundamentally two different ways to do more than one thing at a time: bulk up so that you have multiple processors doing the same thing, or use division of labor, where different processors do different things at the same time.

Data parallelism

The first of those is the easiest to explain. Let’s say you’ve got a four-bit vector that you want to operate on. Let’s make it really simple for the sake of example and say that you need to increment the value of every entry in the vector. In a standard program, you would do this with a loop:

This problem is exceedingly easy to parallelize. In fact, it belongs to a general category of problems whimsically called embarrassingly parallel (Figure 2.4) Each vector entry is completely independent and can be incremented completely independently. Given four processors, you could easily have each processor work on one of the entries and do the entire vector in ¼ the time it takes to do it on a single processor.

Figure 2.4 Embarrassingly parallel computation.

In fact, in this case, it would probably be even less than ¼ because you no longer have the need for an iterator – the i in the pseudocode above; you no longer have to increment i each time and compare it to 4 to see if you’re done (Figure 2.5).

Figure 2.5 Looping in a single core takes more cycles than multicore.

This is referred to as data parallelism; multiple instances of data can be operated on at the same time. The inherent concurrency allows a four-fold speed-up, although a given implementation might choose less if fewer than four processors are available.

Two key attributes of this problem make it so easy to parallelize:

– the operation being performed on one entry doesn’t depend on any other entry

– the number of entries is known and fixed.

That second one is important. If you’re trying to figure out how to exploit concurrency in a way that’s static – in other words, you know exactly how the problem will be parallelized at compile time – then the number of loop iterations must be known at compile time. A while loop or a for loop where the endpoint is calculated instead of constant cannot be so neatly parallelized because, for any given run, you don’t know how many parallel instances there might be.

Functional parallelism

The other way of splitting things up involves giving different processors different things to do. Let’s take a simple example where we have a number of text files and we want to cycle through them to count the number of characters in each one. We could do this with the following pseudo-program:

We can take three processors and give each of them a different task. The first processor opens files; the second counts characters; and the third closes files (Figure 2.6).

Figure 2.6 Different cores performing different operations.

There is a fundamental difference between this and the prior example of data parallelism. In the vector-increment example, we took a problem that had been solved by a loop and completely eliminated the loop. In this new example, because of the serial nature of the three tasks, if you only had one loop iteration, then there would be no savings at all. It only works if you have a workload involving repeated iterations of this loop.

As illustrated in Figure 2.7, when the first file is opened, the second and third processors sit idle. After one file is open, then the second processor can count the characters, while the third processor is still idle. Only when the third file is opened do all processors finally kick in as the third processor closes the first file. This leads to the descriptive term pipeline for this kind of arrangement, and, when executing, it doesn’t really hit its stride until the pipeline fills up. This is also referred to as loop distribution because the duties of one loop are distributed into multiple loops, one on each processor.

Figure 2.7 The pipeline isn’t full until all cores are busy.

This figure also illustrates the fact that using this algorithm on only one file provides no benefit whatsoever.

Real-world programs and algorithms typically have both inherent data and functional concurrency. In some situations, you can use both. For example, if you had six processors, you could double the three-processor pipeline to work through the files twice as fast. In other situations, you may have to decide whether to exploit one or the other in your implementation.

One of the challenges of a pipeline lies in what’s called balancing the pipeline. Execution can only go as fast as the slowest stage. In Figure 2.7, opening files is shown as taking longer than counting the characters. In that situation, counting faster will not improve performance; it will simply increase the idle time between files.

The ideal situation is to balance the tasks so that every pipeline stage takes the same amount of time; in practice, this is so difficult as to be more or less impossible. It becomes even harder when different iterations take more or less time. For instance, it will presumably take longer to count the characters in a bigger file, so really the times for counting characters above should vary from file to file. Now it’s completely impossible to balance the pipeline perfectly.

Dependencies

One of the keys to the simple examples we’ve shown is the independence of operations. Things get more complicated when one calculation depends on the results of another. And there are a number of ways in which these dependencies crop up. We’ll describe some basic cases here, but a complete theory of dependencies can be quite intricate.

It bears noting here that this discussion is intended to motivate some of the key challenges in parallelizing software for multicore. In general, one should not be expected to manually analyze all of the dependencies in a program in order to parallelize it; tools become important for this. For this reason, the discussion won’t be exhaustive, and will show concept examples rather than focusing on practical ways of dealing with dependencies, which will be covered in the chapter on parallelizing software.

Producers and consumers of data

Dependencies are easier to understand if you think of a program of consisting of producers and consumers of data (Figure 2.8). Some part of the program does a calculation that some other part will use: the first part is the producer and the second is the consumer. This happens at very fine-grained instruction levels and at higher levels, especially if you are taking an object-oriented approach – objects are also producers and consumers of data.

Figure 2.8 Producers and consumers at the fine- and coarse-grained level. Entities are often both producers and consumers.

At its most basic, a dependency means that a consumer of data must wait to consume its data until the producer has produced the data (Figure 2.9). The concept is straightforward, but the implications vary depending on the language and approach taken. At the instruction level, many compilers have been designed to exploit low-level concurrency, doing things like instruction reordering to make execution more efficient while making sure that no dependencies are violated.

Figure 2.9 A consumer cannot proceed until it gets its data from the producer.

It gets more complicated with languages like C that allow pointers. The concept is the same, but compilers have no way of understanding how various pointers relate, and so can’t do any optimization. There are two reasons why this is so: pointer aliasing and pointer arithmetic.

Pointer aliasing is an extremely common occurrence in a C program. If you have a function that takes a pointer to, say, an image as a parameter, that function may name the pointer imagePtr. If a program needs to call that function on behalf of two different images – say, leftImage and rightImage, then when the function is called with leftImage as the parameter, then leftImage and imagePtr will refer to the same data. When called for rightImage, then rightImage and imagePtr will point to the same data (Figure 2.10).

Figure 2.10 Different pointers may point to the same locations at different times.

This is referred to as aliasing because a given piece of data may be accessed by variables of different names at different times. There’s no way to know statically what the dependencies are, not only because the names look completely different, but also because they may change as the program progresses. Thorough dynamic analysis is required to understand the relationships between pointers.

Pointer arithmetic can also be an obvious problem because, even if you know where a pointer starts out, manipulating the actual address being pointed to can result in the pointer pointing pretty much anywhere (including address 0, which any C programmer has done at least once in his or her life). Where it ends up pointing may or may not correlate to a memory location associated with some other pointer (Figure 2.11).

Figure 2.11 Pointer arithmetic can cause a pointer to refer to some location in memory that may or may not be pointed to by some other pointer.

For example, when scanning through an array with one pointer to make changes, it may be very hard to understand that some subsequent operation, where a different pointer scans through the same array (possibly using different pointer arithmetic), will read that data (Figure 2.12). If the second scan consumes data that the first scan was supposed to put into place, then parallelizing those as independent will cause the program to function incorrectly. In many cases, this dependency cannot be identified by static inspection; the only way to tell is to notice at run time that the pointers address the same space.

Figure 2.12 Two pointers operating on the same array create a dependency that isn’t evident by static inspection.

These dependencies are based on a consumer needing to wait until the producer has created the data: writing before reading. The opposite situation also exists: if a producer is about to rewrite a memory location, you want to be sure that all consumers of the old data are done before you overwrite the old data with new data (Figure 2.13). This is called an anti-dependency. Everything we’ve discussed about dependencies also holds for anti-dependencies except that this is about waiting to write until all the reads are done: reading before writing.

Figure 2.13 The second pointer must wait before overwriting data until the first pointer has completed its read, creating an anti-dependency.

This has been an overview of dependencies; they will be developed in more detail in the Partitioning chapter.

Loops and dependencies

Dependencies become more complex when loops are involved – and in programs being targeted for parallelization – loops are almost always involved. We saw above how an embarrassingly parallel loop can be parallelized so that each processor gets one iteration of the loop. Let’s look at an example that’s slightly different from that example.

Note that in this and all examples like this, I’m ignoring what happens for the first iteration, since that detail isn’t critical for the discussion.

This creates a subtle change because each loop iteration produces a result that will be consumed in the next loop iteration. So the second loop iteration can’t start until the first iteration has produced its data. This means that the loop iterations can no longer run exactly in parallel: each of these parallel iterations is offset from its predecessor (Figure 2.14). While the total computation time is still less than required to execute the loop on a single processor, it’s not as fast as if there were no dependencies between the loop iterations. Such dependencies are referred to as loop-carry (or loop-carried) dependencies.

Figure 2.14 Even though iterations are parallelized, each must wait until its needed data is produced by the prior iteration, causing offsets that increase overall computation time above what would be required for independent iterations.

It gets even more complicated when you have nested loops iterating across multiple iterators. Let’s say you’re traversing a two-dimensional matrix using i to scan along a row and using j to scan down the rows (Figure 2.15).

Figure 2.15 4×4 array with i iterating along a row (inner loop) and j iterating down the rows (outer loop).

And let’s assume further that a given cell depends on the new value of the cell directly above it (Figure 2.16):

Figure 2.16 Each cell gets a new value that depends on the new value in the cell in the prior row.

First of all, there are lots of ways to parallelize this code, depending on how many cores we have. If we were to go as far as possible, we would need 16 cores since there are 16 cells. Or, with four cores, we could assign one row to each core.

If we did the latter, then we couldn’t start the second row until the first cell of the first row was calculated (Figure 2.17).

Figure 2.17 If each row gets its own core, then each row must wait until the first cell in the prior row is done before starting.

If we completely parallelized it, then we could start all of the first-row entries at the same time, but the second-row entries would have to wait until their respective first-row entries were done (Figure 2.18).

Figure 2.18 An implementation that assigns each cell to its own core.

Note that using so many cores really doesn’t speed anything up: using only four cores would do just as well since only four cores would be executing at any given time (Figure 2.19). This implementation assigns one column to each core, instead of one row, as is done in Figure 2.17. As a result, the loop can be processed faster because no core has to wait for any other core. There is no way to parallelize this set of nested loops any further because of the dependencies.

Figure 2.19 Four cores can implement this loop in the same time as 16.

Such nested loops give rise to the concept of loop distance. Each iterator gets a loop distance. So in the above example, in particular as shown in Figure 2.16, where the arrows show the dependency, the loop distance for i is 0 since there is no dependency; the loop distance for j is 1, since the data consumed in one cell depends on the cell directly above it, which is the prior j iteration. As a vector, the loop distance for i and j is [0,1].

If we changed the code slightly to make the dependency on j−2 instead of j−1:

then the loop distance for j is 2, as shown in Figure 2.20.

Figure 2.20 Example showing j loop distance of 2.

This means that the second row doesn’t have to wait for the first row, since it no longer depends on the first row. The third row, however, does have to wait for the first row (Figure 2.21). Thus we can parallelize further with more cores, if we wish, completing the task in half the time required for the prior example.

Figure 2.21 With loop distance of 2, two rows can be started in parallel.

While it may seem obscure, the loop distance is an important measure for synchronizing data. It’s not a matter of one core producing data and the other immediately consuming it; the consuming core may have to wait a number of iterations before consuming the data, depending on how things are parallelized. While it’s waiting, the producer continues with its iterations, writing more data. Such data can be, for example, written into some kind of first-in/first-out (FIFO) memory, and the loop distance determines how long that FIFO has to be. This will be discussed more fully in the Partitioning

Enjoying the preview?
Page 1 of 1