Python High Performance - Second Edition
()
About this ebook
- Identify the bottlenecks in your applications and solve them using the best profiling techniques
- Write efficient numerical code in NumPy, Cython, and Pandas
- Adapt your programs to run on multiple processors and machines with parallel programming
The book is aimed at Python developers who want to improve the performance of their application. Basic knowledge of Python is expected
Gabriele Lanaro
Gabriele Lanaro is a PhD student in Chemistry at the University of British Columbia, in the field of Molecular Simulation. He writes high performance Python code to analyze chemical systems in large-scale simulations. He is the creator of Chemlab—a high performance visualization software in Python—and emacs-for-python—a collection of emacs extensions that facilitate working with Python code in the emacs text editor. This book builds on his experience in writing scientific Python code for his research and personal projects.
Related to Python High Performance - Second Edition
Related ebooks
Building Python Real-Time Applications with Storm Rating: 0 out of 5 stars0 ratingsDistributed Computing with Python Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark - Second Edition Rating: 0 out of 5 stars0 ratingsBuilding Web Applications with Flask Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsLearning Python Design Patterns - Second Edition Rating: 0 out of 5 stars0 ratingsFlask Blueprints Rating: 0 out of 5 stars0 ratingsMastering Flask Rating: 0 out of 5 stars0 ratingsTest-Driven Machine Learning Rating: 0 out of 5 stars0 ratingsModular Programming with Python Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python - Second Edition Rating: 0 out of 5 stars0 ratingsMastering IPython 4.0 Rating: 0 out of 5 stars0 ratingsPython Unlocked Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsMastering Python Design Patterns Rating: 0 out of 5 stars0 ratingsLearning Apache Thrift Rating: 0 out of 5 stars0 ratingsAdvance Core Python Programming: Begin your Journey to Master the World of Python (English Edition) Rating: 4 out of 5 stars4/5Learning Cython Programming - Second Edition Rating: 0 out of 5 stars0 ratingsLearning Apache Mahout Classification Rating: 0 out of 5 stars0 ratingsMastering Scala Machine Learning Rating: 0 out of 5 stars0 ratingsScala for Machine Learning Rating: 0 out of 5 stars0 ratingsNeo4j High Performance Rating: 0 out of 5 stars0 ratingsNeo4j Cookbook Rating: 0 out of 5 stars0 ratingsMastering Python High Performance Rating: 0 out of 5 stars0 ratingsFeature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsLucene 4 Cookbook Rating: 0 out of 5 stars0 ratingsParallel Programming with Python Rating: 0 out of 5 stars0 ratingsData Analysis with Python and PySpark Rating: 0 out of 5 stars0 ratings
Programming For You
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5101 Amazing Nintendo NES Facts: Includes facts about the Famicom Rating: 4 out of 5 stars4/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards Rating: 0 out of 5 stars0 ratingsPython Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Beginning Programming with Python For Dummies Rating: 3 out of 5 stars3/5
Reviews for Python High Performance - Second Edition
0 ratings0 reviews
Book preview
Python High Performance - Second Edition - Gabriele Lanaro
Title Page
Python High Performance
Second Edition
Build robust applications by implementing concurrent and distributed processing techniques
Gabriele Lanaro
BIRMINGHAM - MUMBAI
Copyright
Python High Performance
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2013
Second edition: May 2017
Production reference: 2250517
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78728-289-6
www.packtpub.com
Credits
About the Author
Dr. Gabriele Lanaro has been conducting research to study the formation and growth of crystals using medium and large-scale computer simulations. In 2017, he obtained his PhD in theoretical chemistry. His interests span machine learning, numerical computing visualization, and web technologies. He has a sheer passion for good software and is the author of the chemlab and chemview open source packages. In 2013, he authored the first edition of the book "High Performance Python Programming".
I'd like to acknowledge the support from Packt editors, including Vikas Tiwari. I would also like to thank my girlfriend, Harani, who had to tolerate the way-too-long writing nights, and friends who provided company and support throughout. Also, as always, I’d love to thank my parents for giving me the opportunity to pursue my ambitions.
Lastly, I would like to thank Blenz coffee for powering the execution engine of this book through electricity and caffeine.
About the Reviewer
Will Brennan is a C++/Python developer based in London with previous experience in writing molecular dynamics simulations. He is currently working on high-performance image processing and machine learning applications. You can refer to his repositories at https://github.com/WillBrennan.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787282899.
If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Table of Contents
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Benchmarking and Profiling
Designing your application
Writing tests and benchmarks
Timing your benchmark
Better tests and benchmarks with pytest-benchmark
Finding bottlenecks with cProfile
Profile line by line with line_profiler
Optimizing our code
The dis module
Profiling memory usage with memory_profiler
Summary
Pure Python Optimizations
Useful algorithms and data structures
Lists and deques
Dictionaries
Building an in-memory search index using a hash map
Sets
Heaps
Tries
Caching and memoization
Joblib
Comprehensions and generators
Summary
Fast Array Operations with NumPy and Pandas
Getting started with NumPy
Creating arrays
Accessing arrays
Broadcasting
Mathematical operations
Calculating the norm
Rewriting the particle simulator in NumPy
Reaching optimal performance with numexpr
Pandas
Pandas fundamentals
Indexing Series and DataFrame objects
Database-style operations with Pandas
Mapping
Grouping, aggregations, and transforms
Joining
Summary
C Performance with Cython
Compiling Cython extensions
Adding static types
Variables
Functions
Classes
Sharing declarations
Working with arrays
C arrays and pointers
NumPy arrays
Typed memoryviews
Particle simulator in Cython
Profiling Cython
Using Cython with Jupyter
Summary
Exploring Compilers
Numba
First steps with Numba
Type specializations
Object mode versus native mode
Numba and NumPy
Universal functions with Numba
Generalized universal functions
JIT classes
Limitations in Numba
The PyPy project
Setting up PyPy
Running a particle simulator in PyPy
Other interesting projects
Summary
Implementing Concurrency
Asynchronous programming
Waiting for I/O
Concurrency
Callbacks
Futures
Event loops
The asyncio framework
Coroutines
Converting blocking code into non-blocking code
Reactive programming
Observables
Useful operators
Hot and cold observables
Building a CPU monitor
Summary
Parallel Processing
Introduction to parallel programming
Graphic processing units
Using multiple processes
The Process and Pool classes
The Executor interface
Monte Carlo approximation of pi
Synchronization and locks
Parallel Cython with OpenMP
Automatic parallelism
Getting started with Theano
Profiling Theano
Tensorflow
Running code on a GPU
Summary
Distributed Processing
Introduction to distributed computing
An introduction to MapReduce
Dask
Directed Acyclic Graphs
Dask arrays
Dask Bag and DataFrame
Dask distributed
Manual cluster setup
Using PySpark
Setting up Spark and PySpark
Spark architecture
Resilient Distributed Datasets
Spark DataFrame
Scientific computing with mpi4py
Summary
Designing for High Performance
Choosing a suitable strategy
Generic applications
Numerical code
Big data
Organizing your source code
Isolation, virtual environments, and containers
Using conda environments
Virtualization and Containers
Creating docker images
Continuous integration
Summary
Preface
The Python programming language has seen a huge surge in popularity in recent years, thanks to its intuitive, fun syntax, and its vast array of top-quality third-party libraries. Python has been the language of choice for many introductory and advanced university courses as well as for numerically intense fields, such as the sciences and engineering. Its primary applications also lies in machine learning, system scripting, and web applications.
The reference Python interpreter, CPython, is generally regarded as inefficient when compared to lower-level languages, such as C, C++, and Fortran. CPython’s poor performance lies in the fact that the program instructions are processed by an interpreter rather than being compiled to efficient machine code. While using an interpreter has several advantages, such as portability and the additional compilation step, it does introduce an extra layer of indirection between the program and the machine, which causes a less efficient execution.
Over the years, many strategies have been developed to overcome CPython's performance shortcomings. This book aims to fill this gap and will teach how to consistently achieve strong performance out of your Python programs.
This book will appeal to a broad audience as it covers both the optimization of numerical and scientific codes as well as strategies to improve the response times of web services and applications.
The book can be read cover-to-cover ; however, chapters are designed to be self-contained so that you can skip to a section of interest if you are already familiar with the previous topics.
What this book covers
Chapter 1, Benchmark and Profiling, will teach you how to assess the performance of Python programs and practical strategies on how to identify and isolate the slow sections of your code.
Chapter 2, Pure Python Optimizations, discusses how to improve your running times by order of magnitudes using the efficient data structures and algorithms available in the Python standard library and pure-Python third-party modules.
Chapter 3, Fast Array Operations with NumPy and Pandas, is a guide to the NumPy and Pandas packages. Mastery of these packages will allow you to implement fast numerical algorithms with an expressive, concise interface.
Chapter 4, C Performance with Cython, is a tutorial on Cython, a language that uses a Python-compatible syntax to generate efficient C code.
Chapter 5, Exploring Compilers, covers tools that can be used to compile Python to efficient machine code. The chapter will teach you how to use Numba, an optimizing compiler for Python functions, and PyPy, an alternative interpreter that can execute and optimize Python programs on the fly.
Chapter 6, Implementing Concurrency, is a guide to asynchronous and reactive programming. We will learn about key terms and concepts, and demonstrate how to write clean, concurrent code using the asyncio and RxPy frameworks.
Chapter 7, Parallel Processing, is an introduction to parallel programming on multi-core processors and GPUs. In this chapter, you will learn to achieve parallelism using the multiprocessing module and by expressing your code using Theano and Tensorflow.
Chapter 8, Distributed Processing, extends the content of the preceding chapter by focusing on running parallel algorithms on distributed systems for large-scale problems and big data. This chapter will cover the Dask, PySpark, and mpi4py libraries.
Chapter 9, Designing for High Performance, discusses general optimization strategies and best practices to develop, test, and deploy your high-performance Python applications.
What you need for this book
The software in this book is tested on Python version 3.5 and on Ubuntu version 16.04. However, majority of the examples can also be run on the Windows and Mac OS X operating systems.
The recommended way to install Python and the associated libraries is through the Anaconda distribution, which can be downloaded from https://www.continuum.io/downloads, for Linux, Windows, and Mac OS X.
Who this book is for
The book is aimed at Python developers who want to improve the performance of their application; basic knowledge of Python is expected.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: To summarize, we will implement a method called ParticleSimulator.evolve_numpy and benchmark it against the pure Python version, renamed as ParticleSimulator.evolve_python
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: On the right, clicking on the tab Callee Map will display a diagram of the function costs.
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-High-Performance-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PythonHighPerformanceSecondEdition_ColorImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.
Benchmarking and Profiling
Recognizing the slow parts of your program is the single most important task when it comes to speeding up your code. Luckily, in most cases, the code that causes the application to slow down is a very small fraction of the program. By locating those critical sections, you can focus on the parts that need improvement without wasting time in micro-optimization.
Profiling is the technique that allows us to pinpoint the most resource-intensive spots in an application. A profiler is a program that runs an application and monitors how long each function takes to execute, thus detecting the functions in which your application spends most of its time.
Python provides several tools to help us find these bottlenecks and measure important performance metrics. In this chapter, we will learn how to use the standard cProfile module and the line_profiler third-party package. We will also learn how to profile an application's memory consumption through the memory_profiler tool. Another useful tool that we will cover is KCachegrind, which can be used to graphically display the data produced by various profilers.
Benchmarks are small scripts used to assess the total execution time of your application. We will learn how to write benchmarks and how to accurately time your programs.
The list of topics we will cover in this chapter is as follows:
General principles of high performance programming
Writing tests and benchmarks
The Unix time command
The Python timeit module
Testing and benchmarking with pytest
Profiling your application
The cProfile standard tool
Interpreting profiling results with KCachegrind
line_profiler and memory_profiler tools
Disassembling Python code through the dis module
Designing your application
When designing a performance-intensive program, the very first step is to write your code without bothering with small optimizations:
Premature optimization is the root of all evil.
- Donald Knuth
In the early development stages, the design of the program can change quickly and may require large rewrites and reorganizations of the code base. By testing different prototypes without the burden of optimization, you are free to devote your time and energy to ensure that the program produces correct results and that the design is flexible. After all, who needs an application that runs fast but gives the wrong answer?
The mantras that you should remember when optimizing your code are as follows:
Make it run: We have to get the software in a working state, and ensure that it produces the correct results. This exploratory phase serves to better understand the application and to spot major design issues in the early stages.
Make it right: We want to ensure that the design of the program is solid. Refactoring should be done before attempting any performance optimization. This really helps separate the application into independent and cohesive units that are easier to maintain.
Make it fast: Once our program is working and is well structured, we can focus on performance optimization. We may also want to optimize memory usage if that constitutes an issue.
In this section, we will write and profile a particle simulator test application. The simulator is