Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)
Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)
Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)
Ebook581 pages4 hours

Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book will teach you everything about the powerful techniques and applications of parallel computing, from the basics of parallel programming to the cutting-edge innovations shaping the future of computing.


The book starts with an intro

LanguageEnglish
Release dateApr 13, 2023
ISBN9789388590747
Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)

Related to Parallel and High Performance Programming with Python

Related ebooks

Programming For You

View More

Related articles

Reviews for Parallel and High Performance Programming with Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Parallel and High Performance Programming with Python - Fabio Nelli

    CHAPTER 1

    Introduction to Parallel Programming

    In this first chapter of the book, we will introduce the concept of parallel programming by talking about all the fundamental concepts that are involved, and which are necessary to fully understand the features and uses. We will first talk about the hardware components that have allowed execution in parallel on new computers, such as CPUs and cores, and then about the entities of the operating system that are the real actuators of parallelism: processes and threads. Subsequently, the programming models of parallelism will be illustrated in detail, introducing fundamental concepts such as concurrency, synchronicity, and asynchronicity.

    Once these general concepts have been introduced, we will see the peculiarities of Python in this area, especially with threads, talking about the Global Interpreter Lock (GIL) and the problems it introduces. We will mention standard Python library modules such as threading and multiprocessing which we will cover in more depth in the next chapters. Finally, we will close the chapter by talking about the evaluation methods of a parallel program, such as speedup and scaling, and discussing the problems that can be introduced by programming in parallel (race condition, deadlock, and so on).

    By the end of this chapter, you will have understood all the fundamental concepts and terminology behind parallel programming. You will have built a general scheme in your mind in which all the protagonists of the parallel execution will be present and how they act to achieve it. Then, you will be ready to tackle the practical part of programming covered in the following chapters.

    Structure

    In this chapter, we will discuss the following topics:

    CPU and cores

    Processes and threads

    Parallel and concurrent programming

    GIL and threads with Python

    Speedup and Scaling

    Parallel programming

    If you are reading this book, it is certainly because you have already understood the need to increase the potential of your code, discovering the limits of traditional models that follow, for historical reasons (limit of old computers), a serial approach.

    The advent of new hardware technologies has given us the opportunity to be able to run multiple programs simultaneously on our computers. In fact, our computers, even the simplest ones, have a multi-core system that allows programs to run in parallel. Why not take advantage of this architecture then?

    Too often you have found yourself developing a Python program to perform a series of operations. Often in the scientific field, it is necessary to implement a series of algorithms to carry out very laborious calculations. But at the end of your work, by running the program on your computer, you will find with disappointment that it is not as fast as you hoped, and the execution times become too long as the size of the problem you are processing grows. But it’s not just a speed issue. More and more frequently, today, we have to deal with ever larger amounts of data, and with the calculations related to it, programs need ever greater memory resources, which, despite their power, our computers do not manage to deal with.

    Parallel programming allows you to execute parts of the code of one of our programs simultaneously, significantly increasing performance. Programming in parallel, therefore, means reducing the execution time of a program, using resources more efficiently, and being able to perform more complex operations that previously would have been prohibitive.

    Technological evolution of computers and parallelism

    Today, for many programmers, parallel programming is still an unfamiliar thing, since it is still a fairly recent technique. In fact, only a few years ago, all computers available to developers were equipped with a single Arithmetic Logic Unit (ALU) and serial programming was the only conceivable. The program instructions were executed one at a time in a sequential manner (see Figure 1.1):

    Figure 1.1: Serial execution

    Many of you will in fact remember the characteristics of the computer generally indicated with the frequency of the processor in Hz, which indicates the instructions that can be executed per second. The power of a computer was primarily measured by its computing frequency. The higher this value was, the faster the programs ran.

    The concept of parallelism is a concept that was gradually created with the evolution of the hardware present inside computers. Until 1980s, computers were very limited: they ran one program at a time, instruction after instruction, in a strictly sequential manner. It is clear that in such a technological environment, the concept of parallelism could not even be imagined in the slightest.

    With the advent of the Intel 80386 processor, the possibility was introduced for the computer to interrupt the execution of one program in order to work on another. Consequently, concepts such as pre-emptive programming and time-slicing were born. This technological advance introduced a pseudo-parallelism effect since the user saw multiple programs working at the same time. With the subsequent Intel 80486 processor, the situation was further improved by introducing a pipeline system based on the subdivision of programs into subtasks. These were performed independently, alternating between the various programs. Furthermore, the internal architecture made it possible, for the first time, to assemble several different instructions (even from different programs) and execute them altogether at the same time (but not simultaneously). And this is where the real development of concurrent programming took place. The instruction portions of the different subtasks are completed in order to be executed as soon as possible (see Figure 1.2):

    Figure 1.2: Concurrent execution

    The situation went on for over a decade, seeing the release of increasingly powerful processor models, able to work at higher frequencies than the previous ones. But this situation soon went into crisis due to a series of problems and physical limitations. Increasing the frequency of execution means at the same time increasing the generation of heat and the consequent energy consumption. It was clear that the frequency boost would soon reach its limits.

    And that’s how processors took a leap of innovation, with the introduction of cores in their system. These, also known as logical processors, are allowed to simulate the presence of multiple processors within a single CPU, resulting in multi-core CPUs. In practice, one could have a multiprocessor computer capable of executing instructions from different programs simultaneously, in parallel. And it is therefore in the early 2000s that parallel programming was developed, giving the possibility for developers to be able to simultaneously execute different parts of the same program.

    CPU, cores, threads, and processes

    To understand the concepts that we will cover in this book, it is essential to first know what threads and processes are, and how they are closely related to the execution modes by the CPU and cores.

    These are not abstract concepts, but real entities existing in our operating system. So to get familiar with them we can go and take a look directly at our operating system. For example, if you are working on Windows, open the Task Manager and click on the Performance tab.

    You will get a window very similar to the one shown in Figure 1.3 where it is possible to monitor in real-time the consumption of the various resources, such as the CPU, memory, and Wi-Fi network:

    Figure 1.3: Task manager in Windows

    In addition, a variety of information such as the number of processes and currently running threads are also shown. On the right, are listed some characteristics of the system we are working on such as the number of cores.

    If, on the other hand, you work on Linux systems such as Ubuntu, you can have a corresponding application by writing from the terminal:

    $ top

    A screen very similar to the one shown in Figure 1.4 will appear:

    Figure 1.4: top on Ubuntu terminal

    As we can see, at the top all the resources in use are shown with their values which are updated in a current way. In the lower part, there is a list of all active processes in the operating system. As you can see, each process is identified by a unique number, the process identification number (PID).

    Since Linux systems are much more flexible and powerful, especially thanks to the numerous shell commands, we can also monitor all the threads related to every single command. For this purpose, we will use a more specific command to monitor processes: pid.

    $ pid -T

    The -T option is used to indicate that the threads that are realistic to the process will be shown. The pid of the process that you want to monitor in detail is then passed to the pid command. In my case, choosing for example the process with pid 2176, I will get the result as shown in Figure 1.5, in which all the threads of the preceding one are shown with their identification number, SPID:

    Figure 1.5: pid command results on Ubuntu terminal

    The Central Processing Unit (CPU) is the real brain of our computer and basically, it is the place where our code is processed. The CPU is characterized by cycles, that is, the time units used by the CPU to perform an operation on the processor. Often we indicate the power of a CPU considering the frequency of cycles per second (see the 2.87GHz speed value in Figure 1.3).

    The CPU can have one (single-core CPU) or multiple cores (multi-core CPU) inside. Cores are data execution units within the CPU. Each core is capable of running multiple processes. A process is essentially a program that runs on the machine and to which a section of memory is reserved. Furthermore, each process can in turn start other processes (sub-process), or run one (MainThread), or more threads within it. A diagram of all this is shown in Figure 1.6:

    Figure 1.6: CPU, core, process, and threads

    Threads, in turn, can be considered sub processes that run concurrently within a single processor. Like processes, threads also have a series of similar mechanisms that manage their synchronization, data exchange, and state transitions during their execution (ready, running, and blocked).

    This is the general framework that we must have in mind to better understand how the processes and threads within our machines operate and consequently model the programming in parallel in the best possible way.

    Concurrent and parallel programming

    There is often confusion between concurrency and parallelism and it is not uncommon for the two terms to be used interchangeably, but this is incorrect. The two concepts, although closely related, are different in the context of parallel programming, and it is very important to understand the differences.

    Let’s start with the things the two concepts have in common. Both concurrency and parallelism occur when we have a program that must perform multiple tasks at the same time. But this is precisely the meaning of concurrency.

    Concurrency means managing (and not executing) multiple tasks at the same time, but they won’t necessarily run simultaneously.

    So a program that will have to perform several tasks at the same time, can do it even by processing only one task at a time. As soon as it has finished executing the instructions relating to a task or a portion of it (subtask), the program will move on to the next task, and so on. One task after another, alternating between them, will be concluded and the program will complete its task. If it helps, you can think of tasks as competing with each other for execution.

    So in this case, even if our computer has a single core CPU, a competing program can easily run (see Figure 1.7):

    Figure 1.7: Concurrency in a single core CPU

    From the outside, the user will see several tasks being performed simultaneously, but internally, only one task at a time will be executed in the CPU.

    But concurrent programming also extends to multi-core CPUs or multi-processor computers. In this case, you could have a competition case as follows:

    Figure 1.8: Concurrency in a multi-core CPU

    As we can see in Figure 1.8, things get more complicated. Since there are multiple processing units (multiple cores), subtasks can be assigned to each and executed simultaneously. We, therefore, have the phenomenon of parallelism.

    Parallelism means performing multiple tasks at the same time simultaneously.

    Hence parallelism is a special case of concurrent programming.

    Parallelism occurs when a program assigns each task to a core CPU so that each of them can be processed simultaneously, that is, in parallel, as shown in Figure 1.9:

    Figure 1.9: Parallelism in a multi-core CPU

    Hence, parallelism requires hardware with multiple process units, essentially a multicore CPU. In a single-core CPU, concurrency can be mimicked but not parallelism.

    Threads and processes in Python for concurrent and parallel models

    Having now understood the difference between concurrent programming and parallel programming, let’s take it a step further. In many programming languages, it is common practice to associate threads with concurrency and processes with parallelism. In fact, these two entities of the operating system will incorporate the two different functionalities of concurrency and parallelism.

    As far as Python is concerned, however, it is good to divide these cases into two distinct programming models. In fact, threads in Python don’t behave as perfectly as threads in the operating system. Threads in Python cannot run concurrently, and therefore cannot operate in parallel. Working with threads in Python is like working with a single-core CPU, although this is not the case.

    Python thread problem: the GIL

    The fact that threads in Python, unlike other programming languages, cannot be executable simultaneously on two different cores, is closely linked to the Python interpreter itself. In fact, the interpreter on which Python code has always been running was implemented in CPython, and during its implementation, it was realized that it was not fully thread-safe. That is, the more threads tried to access a certain object in common (the memory is shared between the threads), it often ran into a state of inconsistency, due to the phenomenon of the race condition. To avoid this huge problem, the Global Interpreter Lock (GIL) has been included within the interpreter. The Python designers therefore made the choice that within a process, only one thread can be executed at a time, eliminating the parallelism of this type of entity (no multithreading).

    GIL is only acquired by one thread at a time while all other threads are waiting. As soon as the thread has finished its task, the GIL is released which is thus acquired by the next thread. There is, therefore, a real concurrent execution. Concurrent programs are generally less costly in terms of resources than parallel programs, as creating new processes is much more expensive than creating threads. It should be borne in mind, however, that the operations of acquisition and release of the lock affect slowing down the execution of the entire program.

    But things are not that bad. In fact, later we will see how to adapt this peculiarity of the Python language threads in parallel programming models. Furthermore, many external libraries do not rely on the GIL, since they have been implemented with other languages such as C and Fortran, and therefore will take advantage of internal mechanisms that use multithreading. One of these libraries is precisely NumPy, a fundamental library for numerical computation in Python.

    Elimination of GIL to achieve multithreading

    As for the possibility of removing the GIL from the Python interpreter, it has always been a hot topic. However, this possibility has become increasingly difficult with time since it would be too difficult to remove the GIL without excluding many official, and third-party packages, and modules used in Python.

    Another possibility could be to use other Python implementations other than CPython. The most widespread of these, PyPy, famous for its greater performance, has unfortunately also implemented a GIL very similar to that of CPython. Instead, Jython, a version of Python implemented in Java and IronPython, implemented with .NET, do not have GILs in them, and can make use of multithreading, and therefore take advantage of the presence of multiple cores or processors.

    Threads versus processes in Python

    Summarizing then, threads and processes are the tools that Python provides us for the implementation of programs in concurrent and parallel form, respectively.

    In Table 1.1, you can see some characteristics of the two entities compared with each other and which must be taken into account during programming.

    Table 1.1: Threads versus processes in Python

    Concurrency and parallelism in Python

    Therefore, for concurrent programming in Python, taking into account the behavior of the threads in this language, we can correct the definition of concurrency previously given, eliminating the possibility of parallelism.

    Concurrency means managing multiple tasks at the same time, but they won’t necessarily run simultaneously.

    So we can imagine concurrent programming in Python with threads that each perform their tasks independently and in competition with each other for execution. They will alternate with each other in the general flow of execution until they are completed, as shown in Figure 1.10:

    Figure 1.10: Concurrency in Python

    While for parallel programming in Python, processes are perfect for executing tasks simultaneously, that is, in parallel. Each of them will be assigned a task and all together at the same time will be able to execute the instructions inside them, until the completion of the program, as shown in Figure 1.11:

    Figure 1.11: Parallelism in Python

    It is therefore clear that while for other programming languages the terms concurrent and parallel could lead to confusion, in Python, concurrency and parallelism not sharing the same common aspects of simultaneity, are two completely different concepts.

    The light concurrency with greenlets

    As we have just seen, the competition finds in threads a valid tool to implement its programming models.

    But in addition to threads, Python offers another possible alternative: greenlets. From the point of view of competition, using greenlets or threads is equivalent, because in Python the threads are never executed in parallel and therefore with this programming language both work perfectly in concurrent programming. But the creation and management of greenlets are much less expensive in resources than threads. This is why their use in programming is defined as light concurrency. For this reason, greenlets are often used when you need to manage a large number of simple I / O functions, such as what happens in web servers. We will see how to create and manage greenlets with a few simple examples later in the book.

    Parallel programming with Python

    Understanding the role that threads and processes can play in Python. We can delve into parallel programming closely related to the Python language.

    In this language, therefore, parallel programming is expressed exclusively on processes. A program is then divided into several parallelizable subtasks which are each assigned to a different process. Within each of them, we can therefore choose whether to perform the various steps synchronously or asynchronously.

    Synchronous and asynchronous programming

    In this book, and in much of the online documentation regarding parallel programming, the terms synchronous or asynchronous are often referred to, sometimes also referred to as sync and async. In all these cases we refer to two different programming models.

    Unconsciously, when we implement a program in parallel or in competition between multiple processes or threads, it comes naturally to us to structure it synchronously. This is because generally, we all come from a serial programming background and tend to think this way. That is, in the presence of two or more processes (but they could be threads as well as simple functions within a program), a process (PROCESS 1 in Figure 1.12) goes on with its execution, up to a point where it will perform an external call, passing the execution to another process to obtain a service, a calculation or any other operation. The other process (PROCESS 2 in Fig.1.12) will be performed to complete its task and then will return the outcome of the service to the initial process that has been pending in the meantime. Once the necessary result has been obtained, the initial process will resume its execution:

    Figure 1.12: Synchronous programming

    But in reality, asynchronous programming models are actually much more efficient than synchronous ones, both for more efficient use of computing resources and for the amount of time spent running the program. Let’s look at the previous case together but this time seen asynchronously, as shown in Figure 1.13:

    Figure 1.13: Asynchronous programming

    As in the synchronous case, the initial process (PROCESS 1 in Figure 1.13) will continue its execution until the call that will start a second process (PROCESS 2 in Figure.1.13). But this time, the initial trial will not interrupt its execution to wait for the completion of the second trial. It will continue going forward with its execution, regardless of when and how it will obtain the outcome of the second trial.

    As we can guess, asynchronous programming allows us to take advantage in many cases in which we would waste a lot of time waiting for operations that require an external response or a long execution time. It is therefore important to know both models well if you want to make the most of all the potential of parallel programming.

    As for its practical implementation, although not yet completely intuitive for us, it is perfectly possible. All programming languages have internal mechanisms that allow them to be implemented. We will cover asynchronous programming in depth in Chapter 6, Maximizing Performance with GPU Programming using CUDA.

    Map and reduce

    A scheme widely used in parallel programming is that of the Map-Reduce which is mainly based on two phases:

    Mapping

    Reducing

    The first phase, that of mapping, is based on the subdivision of the tasks to be carried out by a program into several parts (tasks) and then assigning them to different processes that will execute them simultaneously, that is, in parallel. Often the execution of each process leads to obtaining a result. So there will be a subsequent phase to the one strictly linked to parallel execution, in which all the results must be recombined together, that is, the reducing phase. Figure 1.14 shows a diagram that can help you better understand what was just said:

    Figure 1.14: Map and reduce pattern in parallel programming

    CPU-bound and I/O-bound operations

    During the design phase of a parallel program, however, attention must be paid to the individual tasks, evaluating whether among them there may be some that require too long an execution time. If this were the case, there would be a high-performance degradation, as all other processes would be waiting to complete the mapping phase. In fact, to pass to the reducing phase, all the results obtained from each process will be required. Let us consider a case like the one represented in Figure 1.15 where one of the parallel processes requires too much execution time compared to the others. In this case, we will have all the other processes waiting to continue the execution and to pass the results to the reducing phase. In this case, parallel programming is no longer performing:

    Figure 1.15: Parallel programming with low performance

    So in these cases, we have to consider the various operations that are performed in every single process (task). These tasks could include internal operations such as reading a file or calling an external web service. In this case, the process will have to wait for a response from an external device and therefore the execution times

    Enjoying the preview?
    Page 1 of 1