Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC
Ebook1,171 pages6 hours

Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The only book to offer special coverage of the fundamentals of multicore DSP for implementation on the TMS320C66xx SoC 

This unique book provides readers with an understanding of the TMS320C66xx SoC as well as its constraints. It offers critical analysis of each element, which not only broadens their knowledge of the subject, but aids them in gaining a better understanding of how these elements work so well together.

Written by Texas Instruments’ First DSP Educator Award winner, Naim Dahnoun, the book teaches readers how to use the development tools, take advantage of the maximum performance and functionality of this processor and have an understanding of the rich content which spans from architecture, development tools and programming models, such as OpenCL and OpenMP, to debugging tools. It also covers various multicore audio and image applications in detail.  Additionally, this one-of-a-kind book is supplemented with:

  • A rich set of tested laboratory exercises and solutions
  • Audio and Image processing applications source code for the Code Composer Studio (integrated development environment from Texas Instruments)
  • Multiple tables and illustrations

With no other book on the market offering any coverage at all on the subject and its rich content with twenty chapters, Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC is a rare and much-needed source of information for undergraduates and postgraduates in the field that allows them to make real-time applications work in a relatively short period of time. It is also incredibly beneficial to hardware and software engineers involved in programming real-time embedded systems.

LanguageEnglish
PublisherWiley
Release dateNov 30, 2017
ISBN9781119003854
Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC

Related to Multicore DSP

Related ebooks

Technology & Engineering For You

View More

Related articles

Reviews for Multicore DSP

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Multicore DSP - Naim Dahnoun

    1

    Introduction to DSP

    CHAPTER MENU

    1.1 Introduction

    1.2 Multicore processors

    1.2.1 Can any algorithm benefit from a multicore processor?

    1.2.2 How many cores do I need for my application?

    1.3 Key applications of high‐performance multicore devices

    1.4 FPGAs, Multicore DSPs, GPUs and Multicore CPUs

    1.5 Challenges faced for programming a multicore processor

    1.6 Texas Instruments DSP roadmap

    1.7 Conclusion

    References

    Learning how to master a system‐on‐chip (SoC) can be a long, daunting process, especially for the novice. However, keeping in mind the big picture and understanding why a specific piece of hardware or software is used will remove the complexity in the details.

    The purpose of this chapter is give an overview for the need of multicore processors, list different types of multicore processors and introduce the KeyStone processors that are the subject of this book.

    1.1 Introduction

    Today’s microprocessors are based on switching devices that provide alternation between two states, ON and OFF, that represent 1 s and 0 s. Up to now, the transistor is the only practical device to be used. Having small, fast, low‐power transistors has always been the challenge for chip manufacturers. From the 1960s, as predicted by Gordon Moore (in Moore’s law), the number of transistors that could be fitted in an integrated circuit doubled every 24 months [1]. That was possible due to new material, the development of chip process technology and especially the advances in photolithography that pushed the transistor size from 10 µm in the 1960s to about 10 nm currently. As the transistor scaled, industry not only took advantage of the transistor count but also increased the clock speed, using various architecture enhancements such as instruction‐level parallelism (ILP) that can be achieved by superscaling (loading multiple instructions simultaneously and executing them simultaneously), pipelining (where different phases of instructions overlap), out‐of‐order execution (instructions are executed in any order, and the choice of the order is dynamic) and so on, and different power‐efficient cache levels and power‐aware software designs such as compilers for low‐power consumption [2] and low‐power instructions or instructions of variable length. However, the increase in clock frequency was not sustainable as power consumption became such a real constraint that it was not possible to produce a commercial device. In fact, chip manufacturers have abandoned the idea of continually increasing the clock frequency because it was technically challenging and costly and because power consumption was a real issue, especially for mobile computing devices such as smartphones and handheld computers and for high‐performance computers. Recently, static power consumption has also become a concern as the transistor scales, and therefore both dynamic power and static powers are to be considered. It is also worth noting at this stage that increase in the operating frequency requires power consumption increase, that is not linear with the frequency, as one can assume.

    This is due to the fact that an increase in frequency will require an increase in voltage. For instance, an increase of 50% of the frequency will also result in an increase of 35% of the voltage [2].

    To overcome the problem of frequency plateau, processor manufacturers like Texas Instruments (TI), ARM and Intel find that by keeping the frequency at an acceptable level and increasing the number of cores, they will be able to support many application domains that require high performance and low power. Having multicore processors is not a new idea; for instance, TI introduced a 5‐core processor in 1995 (TMS320C8x), a 2‐core processor in 1998 (TMS320C54x) and the OMAP (Open Multimedia Application Platform) family in 2002 [3], and Lucent produced a 3‐core processor in 2000. However, manufacturers and users were not that interested in multicore as the processors’ frequency increase was sufficient to satisfy the market and multicore processors were complex and did not have real software support.

    Ideally, a multicore processor should have the following features:

    Low power

    Low cost

    Low size (small form factor)

    High compute‐performance

    Compute‐performance that can scale through concurrency

    Software support (OpenMP, OpenCL etc.)

    Good development and debugging tools

    Efficient operating system(s)

    Good embedded debugging tools

    Good technical support

    Ease of use

    Chip availability.

    It is important to stress that developing hardware alone is not enough; software plays a very important role. In fact, silicon manufacturers are now introducing software techniques to leverage inherent parallelism available on their devices and attract users. For instance, NVidia introduced CUDA and TI supports Open Event Machine (OpenEM), Open Multi‐Processing (OpenMP) and Open Compute Language (OpenCL) to leverage the performance and reduce the time to market.

    In the embedded computing market, the decision whether to select a digital signal processor (DSP), a CPU (such as an x86 or an ARM), a GPU or a field‐programmable gate array (FPGA) has become very complex, and making the wrong decision can be very costly if not catastrophic if a large volume is involved; for instance, a one dollar difference for one million products will result in a total one million dollars difference. But, for low volume it is sometimes more interesting to select a costly device if development time and future upgrade are taken into account. However, factors like cost, performance per watt, ease of use, time to market, hardware and software support and chip availability can help in selecting the right device or a combination of devices for a specific application.

    For embedded high‐processing‐power systems, the main competing types of devices are the DSPs, FPGAs and GPUs.

    1.2 Multicore processors

    The main features of a multicore device are high performance, scalability and low power consumption. There are two main types of multicore processors: homogeneous (also known as symmetric multiprocessing (SMP)) and heterogeneous multicore processors (also known as asymmetric multiprocessing (AMP)). A homogeneous processor, such as the KeyStone I family of processors [4], has a set of identical processors and a heterogeneous processor, such as the KeyStone II (second-generation KeyStone architecture) [5], has a set of different processors. From a hardware perspective, AMP offers more flexibility for tackling a wider range of complex applications at a lower power consumption. However, they may be more complex to program, as different cores may use different operating systems and different memory structures that need to be interfaced for data exchange and synchronisation. Saying that, it is not always the case when supporting tools are available. For instance, the KeyStone II, which is a heterogeneous processor, is preferred by programmers since the ARM cores provide a rich set of library functions provided by the Linux community and the user can dispatch tasks from the ARMs to the DSPs without dealing with the underlying memory when using OpenCL.

    1.2.1 Can any algorithm benefit from a multicore processor?

    To show the advantages and limitations of multicore processors, let’s first explore Amdahl’s law [6], which states that the performance improvement by parallelising a code will depend on the code that cannot be parallelised. If we refer to Figure 1.1, it shows an original code that is composed of a serial code and a code that can be parallelised. It also shows the code after being parallelised.

    Schematic of displaying boxes with arrows depicting the original code composed of serial code and a code being parallelized and after parallelized.

    Figure 1.1 The impact of the serial code that cannot be parallelised on the performance.

    If we consider the ratio of the original code and the optimised code as shown in Equation (1.1), if S(n) is the speed‐up time and if , then S(n) will be equal to 1 and no speed‐up will be obtained. However, if Ts is equal to 2 * Tp, then the speed‐up will be 2.

    If N is large, then and Equation (1.1) will be reduced to Equation (1.2), which show that the serial code will be dominant.

    (1.1)

    (1.2)

    Knowing the percentage p of code that is serial, one can derive Amdahl’s law as shown in Equation (1.3) by replacing Tserial by T * (1 − p) and Tparallel by T * p in Equation (1.2).

    (1.3)

    Plotting S(n), as shown in Figure 1.2, reveals that having a high number of cores for an application that has a low percentage of parallel code does not increase the speed. For instance, if the percentage of the parallel code is 50% (blue line), then having more cores will bring no real benefit if the number of cores is increased beyond 16 cores.

    Graph of number of processors vs. speedup displaying four ascending curves depicting the parallel portion of 50%, 75%, 90%, and 95%, with number of cores increasing beyond 16 cores.

    Figure 1.2 Amdahl’s law [7].

    In Figure 1.1, the time it takes for cores to communicate is not shown. This is not the case in real applications, where communication and synchronisation times between cores comprise a real challenge, and the more cores that are used, the more time‐consuming are the communication and synchronisation between cores; see illustration in Figure 1.3. It will be shown in this volume that increasing the number of cores does not necessarily increase the performance, and the drawback of parallelism can also introduce the potential for difficulty in debugging deadlocks and race conditions.

    Diagram of inter-processor communication effect illustrating core 1 consist of sequential code down to time for single-core completion, core 2 in processing time, and core 4 for extra time to fork.

    Figure 1.3 The inter‐processor communication effect.

    The second‐generation KeyStone architecture (heterogeneous multicores) provides a better workload balance by distributing specific jobs to specific cores (the right core for the right job!).

    1.2.2 How many cores do I need for my application?

    Figure 1.2 showed that not all applications scale with the number of cores. There are three scenarios that need to be considered:

    Scenario 1. This scenario has been discussed previously and is the case when an algorithm is composed of serial and parallel code. In this case, the number of cores to be used will depend on the parallel codes and the application. For instance, the example shown in Figure 1.4 can run on five cores or three cores, as core 0 can be reused to process part of the parallel code and one of core 1, core 2 or core 3 can be reused to run the final serial code.

    Scenario 2. Some applications require different algorithms running sequentially. Consider the application shown in Figure 1.5. This application captures two videos of a road and performs a disparity calculation using the two videos, then performs a surface fitting to extract the surface of the road. The thresholding then removes the outliers (road surface), and the connected component labelling and detections are used to identify various objects below or above the road. In this application, each core can perform a function and therefore, six cores can perform six different jobs. More cores will not increase the performance if each core is not up to the task allocated to it.

    Scenario 3. This scenario is a combination of scenarios 1 and 2. If we consider again the example shown in Figure 1.5 and the disparity calculation that requires more processing power (as in a practical situation), then more cores will be required. This is illustrated in Figure 1.6. In this application, eight cores will be required.

    Image described by surrounding text.

    Figure 1.4 Example where three cores can perform the task required by the parallel code.

    Image described by surrounding text.

    Figure 1.5 Example where cores are processing different algorithms.

    Cycle diagram illustrating the serial and parallel code being processed simultaneously, depicting the disparity calculation, stereo data capture, surface fitting, detection, and thresholding etc.

    Figure 1.6 Example when serial code and parallel code are processed simultaneously.

    1.3 Key applications of high‐performance multicore devices

    Reducing the operating clock frequency of the multiple processor cores and innovating in the inter‐core communication on a single chip have led to a mirage of applications that are revealed every day and only limited by our own imagination. These applications range from scientific simulation, seismic wave imaging, avionics and defence, communications and telecommunications, consumer electronics, video and imaging, industrial, medical, security and space to high‐performance computing (HPC). In turn, HPC is opening another window of scientific applications, such as advanced manufacturing, earth‐system modelling and weather forecasting, life science, and big data analytics. Access to such machines is costly. However, the arrival of low‐cost, low‐power, high‐performance multicore processors is providing engineers and scientists with unprecedented low‐cost tools.

    HPC requires floating‐point arithmetic that is essential for scientific applications, and therefore the performance is measured in floating‐point operations per second (FLOPs). For instance, at the time of writing this book, the Sunway TaihuLight was number one according to TOP500.org [8, 9]. It was developed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC), contained 10,649,600 cores with a peak performance of 125.4 petaflops (PFLOPs) and consumed 15.3 MW; see the list of the top ten supercomputers in Table 1.1. To put this in perspective, it has been reported that Google’s data centres used 260 MW‐hours, whereas a nuclear power station generates around 500–4000 MW [10]. Also at the time of writing this book, the Shoubu supercomputer from RIKEN was the most energy‐efficient supercomputer and ranked as the top on the Green500 list [11]. The KeyStone SoC with its power efficiency and high performance is gaining momentum for use in green HPC; for instance, PayPal, a leader in online transaction processing, is using Hewlett‐Packard’s Moonshot system which is based on the KeyStone II SoC.

    Table 1.1 Top 10 supercomputers, November 2016 [9]

    The development of an application for an SoC like the KeyStone can be a very long process: an idea is generated, algorithms are developed, selected algorithms are optimised and then they are normally evaluated in programming languages such MATLAB or Python depending on the application. Some algorithms are then developed in Visual Studio or a similar integrated development environment (IDE) to quickly test and debug the application, since the user can, for instance, use some libraries for getting real video or audio signals from a device using OpenCV, which is unlikely to be supported on an SoC. Then the code is translated to C/C++ language and ported to an SoC. The last step is difficult and not trivial, and it can be very daunting even for experienced engineers as they need to master C/C++ language, linear assembly/assembly, MPI (Message Passing Interface), OpenMP (Open Multi‐Processing), OpenEM (Open Event Machine) and OpenCL (Open Computing Language), in addition to knowing the functionality of various peripherals and understanding the Linux and SYSBIOS operating systems and development tools such as Composer Studio. These tools are hardware‐centric and require good understanding of the underlying hardware if maximum performance is to be achieved, especially when multicore programming is involved.

    Increasing a multicore’s performance is a twofold process: (1) to make the sequential part of the code run faster and (2) to exploit the parallelism offered by the multicore.

    1.4 FPGAs, Multicore DSPs, GPUs and Multicore CPUs

    In the past, FPGAs were the first choice for some applications that were not constrained by size or power consumption. For instance, there were not many commercial embedded devices that used FPGAs. However, recently, FPGA SoCs have integrated low‐power software‐programmable processing cores with the hardware programmability of an FPGA, like the Zynq‐7000 from Xilinx, which targets embedded applications such as small cell base stations, multi‐camera driver assistance systems and so on. However, critics may argue that power consumption and size are still not comparable to those of multicore DSPs. Despite further advantages of configurability, reconfigurability and programmability, an FPGA is still unattractive when time‐to‐market, maintenance and upgrades are issues. A comparison between FPGA multicore SoCs can be found in Ref. [12]; see Table 1.2. FPGAs also contribute to the development of SoCs that were traditionally designed using application‐specific integrated circuits (ASICs) that have a substantial cost and time‐to‐market associated with them (they cost millions of dollars and take months to develop)

    Enjoying the preview?
    Page 1 of 1