Python Data Science Essentials
By Boschetti Alberto and Luca Massaron
()
About this ebook
- Quickly get familiar with data science using Python
- Save tons of time through this reference book with all the essential tools illustrated and explained
- Create effective data science projects and avoid common pitfalls with the help of examples and hints dictated by experience
If you are an aspiring data scientist and you have at least a working knowledge of data analysis and Python, this book will get you started in data science. Data analysts with experience of R or MATLAB will also find the book to be a comprehensive reference to enhance their data manipulation and machine learning skills.
Read more from Boschetti Alberto
Python Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Regression Analysis with Python Rating: 0 out of 5 stars0 ratingsLarge Scale Machine Learning with Python Rating: 2 out of 5 stars2/5Python: Real World Machine Learning Rating: 0 out of 5 stars0 ratings
Related to Python Data Science Essentials
Related ebooks
Mastering Python for Data Science Rating: 3 out of 5 stars3/5Python Data Analysis Rating: 4 out of 5 stars4/5Learning pandas - Second Edition Rating: 4 out of 5 stars4/5Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Getting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsInteractive Applications Using Matplotlib Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python - Second Edition Rating: 0 out of 5 stars0 ratingsPython Deep Learning Rating: 5 out of 5 stars5/5Practical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Python Data Analysis Rating: 0 out of 5 stars0 ratingsMastering Social Media Mining with Python Rating: 5 out of 5 stars5/5Advanced Machine Learning with Python Rating: 0 out of 5 stars0 ratingsPython Machine Learning By Example Rating: 4 out of 5 stars4/5Mastering Data Mining with Python – Find patterns hidden in your data Rating: 0 out of 5 stars0 ratingsR High Performance Programming Rating: 4 out of 5 stars4/5Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition) Rating: 0 out of 5 stars0 ratingsArtificial Intelligence with Python - Second Edition: Your complete guide to building intelligent apps using Python 3.x, 2nd Edition Rating: 0 out of 5 stars0 ratingsPython Data Structures and Algorithms Rating: 5 out of 5 stars5/5Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratingsDesigning Machine Learning Systems with Python Rating: 0 out of 5 stars0 ratingsLearning NumPy Array Rating: 0 out of 5 stars0 ratingsMastering pandas for Finance Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R Rating: 4 out of 5 stars4/5
Programming For You
HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsLearn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsCoding All-in-One For Dummies Rating: 4 out of 5 stars4/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsSQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5The Little SAS Book: A Primer, Sixth Edition Rating: 5 out of 5 stars5/5Teach Yourself C++ Rating: 4 out of 5 stars4/5Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5
Reviews for Python Data Science Essentials
0 ratings0 reviews
Book preview
Python Data Science Essentials - Boschetti Alberto
Table of Contents
Python Data Science Essentials
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. First Steps
Introducing data science and Python
Installing Python
Python 2 or Python 3?
Step-by-step installation
A glance at the essential Python packages
NumPy
SciPy
pandas
Scikit-learn
IPython
Matplotlib
Statsmodels
Beautiful Soup
NetworkX
NLTK
Gensim
PyPy
The installation of packages
Package upgrades
Scientific distributions
Anaconda
Enthought Canopy
PythonXY
WinPython
Introducing IPython
The IPython Notebook
Datasets and code used in the book
Scikit-learn toy datasets
The MLdata.org public repository
LIBSVM data examples
Loading data directly from CSV or text files
Scikit-learn sample generators
Summary
2. Data Munging
The data science process
Data loading and preprocessing with pandas
Fast and easy data loading
Dealing with problematic data
Dealing with big datasets
Accessing other data formats
Data preprocessing
Data selection
Working with categorical and textual data
A special type of data – text
Data processing with NumPy
NumPy's n-dimensional array
The basics of NumPy ndarray objects
Creating NumPy arrays
From lists to unidimensional arrays
Controlling the memory size
Heterogeneous lists
From lists to multidimensional arrays
Resizing arrays
Arrays derived from NumPy functions
Getting an array directly from a file
Extracting data from pandas
NumPy fast operation and computations
Matrix operations
Slicing and indexing with NumPy arrays
Stacking NumPy arrays
Summary
3. The Data Science Pipeline
Introducing EDA
Feature creation
Dimensionality reduction
The covariance matrix
Principal Component Analysis (PCA)
A variation of PCA for big data – RandomizedPCA
Latent Factor Analysis (LFA)
Linear Discriminant Analysis (LDA)
Latent Semantical Analysis (LSA)
Independent Component Analysis (ICA)
Kernel PCA
Restricted Boltzmann Machine (RBM)
The detection and treatment of outliers
Univariate outlier detection
EllipticEnvelope
OneClassSVM
Scoring functions
Multilabel classification
Binary classification
Regression
Testing and validating
Cross-validation
Using cross-validation iterators
Sampling and bootstrapping
Hyper-parameters' optimization
Building custom scoring functions
Reducing the grid search runtime
Feature selection
Univariate selection
Recursive elimination
Stability and L1-based selection
Summary
4. Machine Learning
Linear and logistic regression
Naive Bayes
The k-Nearest Neighbors
Advanced nonlinear algorithms
SVM for classification
SVM for regression
Tuning SVM
Ensemble strategies
Pasting by random samples
Bagging with weak ensembles
Random Subspaces and Random Patches
Sequences of models – AdaBoost
Gradient tree boosting (GTB)
Dealing with big data
Creating some big datasets as examples
Scalability with volume
Keeping up with velocity
Dealing with variety
A quick overview of Stochastic Gradient Descent (SGD)
A peek into Natural Language Processing (NLP)
Word tokenization
Stemming
Word Tagging
Named Entity Recognition (NER)
Stopwords
A complete data science example – text classification
An overview of unsupervised learning
Summary
5. Social Network Analysis
Introduction to graph theory
Graph algorithms
Graph loading, dumping, and sampling
Summary
6. Visualization
Introducing the basics of matplotlib
Curve plotting
Using panels
Scatterplots
Histograms
Bar graphs
Image visualization
Selected graphical examples with pandas
Boxplots and histograms
Scatterplots
Parallel coordinates
Advanced data learning representation
Learning curves
Validation curves
Feature importance
GBT partial dependence plot
Summary
Index
Python Data Science Essentials
Python Data Science Essentials
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2015
Production reference: 1240415
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-042-9
www.packtpub.com
Credits
Authors
Alberto Boschetti
Luca Massaron
Reviewers
Robert Dempsey
Daniel Frimer
Kevin Markham
Alberto Gonzalez Paje
Bastiaan Sjardin
Michele Usuelli
Zacharias Voulgaris, PhD
Commissioning Editor
Julian Ursell
Acquisition Editor
Subho Gupta
Content Development Editor
Merwyn D'souza
Technical Editor
Namrata Patil
Copy Editor
Vedangi Narvekar
Project Coordinator
Neha Bhatnagar
Proofreaders
Simran Bhogal
Faye Coulman
Safis Editing
Dan McMahon
Indexer
Priya Sane
Production Coordinator
Komal Ramchandani
Cover Work
Komal Ramchandani
About the Authors
Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a PhD in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges involving natural language processing (NLP), machine learning, and probabilistic graph models everyday. He is very passionate about his job and he always tries to stay updated on the latest developments in data science technologies by attending meetups, conferences, and other events.
I would like to thank my family, my friends, and my colleagues. Also, a big thanks to the open source community.
Luca Massaron is a data scientist and marketing research director who specializes in multivariate statistical analysis, machine learning, and customer insight, with over a decade of experience in solving real-world problems and generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of web audience analysis in Italy to achieving the rank of a top 10 Kaggler, he has always been passionate about everything regarding data and analysis and about demonstrating the potentiality of data-driven knowledge discovery to both experts and nonexperts. Favoring simplicity over unnecessary sophistication, he believes that a lot can be achieved in data science by understanding its essentials.
To Yukiko and Amelia, for their loving patience. Roads go ever ever on, under cloud and under star, yet feet that wandering have gone turn at last to home afar
.
About the Reviewers
Robert Dempsey is an experienced leader and technology professional specializing in delivering solutions and products to solve tough business challenges. His experience in forming and leading agile teams, combined with more than 14 years of experience in the field of technology, enables him to solve complex problems while always keeping the bottom line in mind.
Robert has founded and built three start-ups in technology and marketing, developed and sold two online applications, consulted Fortune 500 and Inc. 500 companies, and spoken nationally and internationally on software development and agile project management.
He is currently the head of data operations at ARPC, an econometrics firm based in Washington, DC. In addition, he's the founder of Data Wranglers DC, a group dedicated to improving the craft of data wrangling, as well as a board member of Data Community DC.
In addition to spending time with his growing family, Robert geeks out on Raspberry Pis and Arduinos and automates most of his life with the help of hardware and software.
Daniel Frimer has been an advocate for the Python language for 2 years now. With a degree in applied and computational math sciences from the University of Washington, he has spearheaded various automation projects in the Python language involving natural language processing, data munging, and web scraping. In his side projects, he has dived into a deep analysis of NFL and NBA player statistics for his fantasy sports teams.
Daniel has recently started working in SaaS at a private company for online health insurance shopping called Array Health, in support of day-to-day data analysis and the perfection of the integration between consumers, employers, and insurers. He has also worked with data-centric teams at Amazon, Starbucks, and Atlas International.
Kevin Markham is a computer engineer, a data science instructor for General Assembly in Washington, DC, and the cofounder of Causetown, an online cause marketing platform for small businesses. He is passionate about teaching data science and machine learning and enjoys both Python and R. He founded Data School (http://dataschool.io) in order to provide in-depth educational resources that are accessible to data science novices. He has an active YouTube channel (http://youtube.com/dataschool) and can also be found on Twitter (@justmarkham).
Alberto Gonzalez Paje is an economist specializing in information management systems and data science. Educated in Spain and the Netherlands, he has developed an international career as a data analyst at companies such as Coca Cola, Accenture, Bestiario, and CartoDB. He focuses on business strategy, planning, control, and data analysis. He loves architecture, cartography, the Mediterranean way of life, and sports.
Bastiaan Sjardin is a data scientist and entrepreneur with a background in artificial intelligence, mathematics, and machine learning. He has an MSc degree in cognitive science and mathematical statistics at the University of Leiden. In the past 5 years, he has worked on a wide range of data science projects. He is a frequent Community TA with Coursera for the Social Network analysis
course at the University of Michigan. His programming language of choice is R and Python. Currently, he is the cofounder of Quandbee (www.quandbee.com), a company specialized in machine learning applications.
Michele Usuelli is a data scientist living in London, specializing in R and Hadoop. He has an MSc in mathematical engineering and statistics, and he has worked in fast-paced, growing environments, such as a big data start-up in Milan, the new pricing and analytics division of a big publishing company, and a leading R-based company. He is the author of R Machine Learning Essentials, Packt Publishing, which is a book that shows how to solve business challenges with data-driven solutions. He has also written articles on R-bloggers and is active on StackOverflow.
Zacharias Voulgaris, PhD, is a data scientist with machine learning expertise. His first degree was in production engineering and management, while his post-graduate studies focused on information systems (MSc) and machine learning (PhD). He has worked as a researcher at Georgia Tech and as a data scientist at Elavon Inc. He currently works for Microsoft as a program manager, and he is involved in a variety of big data projects in the field of web search. He has written several research papers and a number of web articles on data science-related topics and has authored his own book titled Data Scientist: The Definite Guide to Becoming a Data Scientist.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
Data science is a relatively new knowledge domain that requires the successful integration of linear algebra, statistical modelling, visualization, computational linguistics, graph analysis, machine learning, business intelligence, and data storage and retrieval.
The Python programming language, having conquered the scientific community during the last decade, is now an indispensable tool for the data science practitioner and a must-have tool for every aspiring data scientist. Python will offer you a fast, reliable, cross-platform, mature environment for data analysis, machine learning, and algorithmic problem solving. Whatever stopped you before from mastering Python for data science applications will be easily overcome by our easy step-by-step and example-oriented approach that will help you apply the most straightforward and effective Python tools to both demonstrative and real-world datasets.
Leveraging your existing knowledge of Python syntax and constructs (but don't worry, we have some Python tutorials if you need to acquire more knowledge on the language), this book will start by introducing you to the process of setting up your essential data science toolbox. Then, it will guide you through all the data munging and preprocessing phases. A necessary amount of time will be spent in explaining the core activities related to transforming, fixing, exploring, and processing data. Then, we will demonstrate advanced data science operations in order to enhance critical information, set up an experimental pipeline for variable and hypothesis selection, optimize hyper-parameters, and use cross-validation and testing in an effective way.
Finally, we will complete the overview by presenting you with the main machine learning algorithms, graph analysis technicalities, and all the visualization instruments that can make your life easier when it comes to presenting your results.
In this walkthrough, which is structured as a data science project, you will always be accompanied by clear code and simplified examples to help you understand the underlying mechanics and real-world datasets. It will also give you hints dictated by experience to help you immediately operate on your current projects. Are you ready to start? We are sure that you are ready to take the first step towards a long and incredibly rewarding journey.
What this book covers
Chapter 1, First Steps, introduces you to all the basic tools (command shell for interactive computing, libraries, and datasets) necessary to immediately start on data science using Python.
Chapter 2, Data Munging, explains how to upload the data to be analyzed by applying alternative techniques when the data is too big for the computer to handle. It introduces all the key data manipulation and transformation techniques.
Chapter 3, The Data Science Pipeline, offers advanced explorative and manipulative techniques, enabling sophisticated data operations to create and reduce predictive features, spot anomalous cases and apply validation techniques.
Chapter 4, Machine Learning, guides you through the most important learning algorithms that are available in the Scikit-learn library, which demonstrates the practical applications and points out the key values to be checked and the parameters to be tuned in order to get the best out of each machine learning technique.
Chapter 5, Social Network Analysis, elaborates the practical and effective skills that are required to handle data that represents social relations or interactions.
Chapter 6, Visualization, completes the data science overview with basic and intermediate graphical representations. They are indispensable if you want to visually represent complex data structures and machine learning processes and results.
Chapter 7, Strengthen Your Python Foundations, covers a few Python examples and tutorials focused on the key features of the language that it is indispensable to know in order to work on data science projects.
This chapter is not part of the book, but it has to be downloaded from Packt Publishing website at https://www.packtpub.com/sites/default/files/downloads/0429OS_Chapter-07.pdf.
What you need for this book
Python and all the data science tools mentioned in the book, from IPython to Scikit-learn, are free of charge and can be freely downloaded from the Internet. To run the code that accompanies the book, you need a computer that uses Windows, Linux, or Mac OS operating systems. The book will introduce you step-by-step to the process of installing the Python interpreter and all the tools and data that you need to run the examples.
Who this book is for
This book builds on the core skills that you already have, enabling you to become an efficient data science practitioner. Therefore, it assumes that you know the basics of programming and statistics.
The code examples provided in the book won't require you to have a mastery of Python, but we will assume that you know at least the basics of Python scripting, lists and dictionary data structures, and how class objects work. Before starting, you can quickly acquire such skills by spending a few hours on the online courses that we are going to suggest in the first chapter. You can also use the tutorial provided on the Packt Publishing website.
No advanced data science concepts are necessary though, as we will provide you with the information that is essential to understand all the core concepts that are used by the examples in the book.
Summarizing, this book is for the following:
Novice and aspiring data scientists with limited Python experience and a working knowledge of data analysis, but no specific expertise of data science algorithms
Data analysts who are proficient in statistic modeling using R or MATLAB tools and who would like to exploit Python to perform data science operations
Developers and programmers who intend to expand their knowledge and learn about data manipulation and machine learning
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: When inspecting the linear model, first check the coef_ attribute.
A block of code is set as follows:
from sklearn import datasets
iris = datasets.load_iris()
Since we will be using IPython Notebooks along most of the examples, expect to have always an input (marked as In:) and often an output (marked Out:) from the cell containing the block of code. On your computer you have just to input the code after the In: and check if results correspond to the Out: content:
In: clf.fit(X, y)
Out: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
When a command should be given in the terminal command line, you'll find the command with the prefix $>, otherwise, if it's for the Python REPL, it will be preceded by >>>:
$>python >>> import sys >>> print sys.version_info
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get