Learning Data Mining with Python - Second Edition
()
About this ebook
- Use a wide variety of Python libraries for practical data mining purposes.
- Learn how to find, manipulate, analyze, and visualize data using Python.
- Step-by-step instructions on data mining techniques with Python that have real-world applications.
If you are a Python programmer who wants to get started with data mining, then this book is for you. If you are a data analyst who wants to leverage the power of Python to perform data mining efficiently, this book will also help you. No previous experience with data mining is expected.
Robert Layton
Dr. Robert Layton is a Research Fellow at the Internet Commerce Security Laboratory (ICSL) at Federation University Australia. Dr Layton’s research focuses on attribution technologies on the internet, including automating open source intelligence (OSINT) and attack attribution. Dr Layton’s research has led to improvements in authorship analysis methods for unstructured text, providing indirect methods of linking profiles on social media.
Read more from Robert Layton
Python: Real-World Data Science Rating: 0 out of 5 stars0 ratingsAutomating Open Source Intelligence: Algorithms for OSINT Rating: 5 out of 5 stars5/5Learning Data Mining with Python Rating: 0 out of 5 stars0 ratings
Related to Learning Data Mining with Python - Second Edition
Related ebooks
Mastering Social Media Mining with Python Rating: 5 out of 5 stars5/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Interactive Applications Using Matplotlib Rating: 0 out of 5 stars0 ratingsGetting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 4 out of 5 stars4/5Mastering Python Regular Expressions Rating: 5 out of 5 stars5/5Mastering Python Data Analysis Rating: 0 out of 5 stars0 ratingsPython Web Scraping - Second Edition Rating: 5 out of 5 stars5/5Practical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsRegression Analysis with Python Rating: 0 out of 5 stars0 ratingsPython Data Analysis Rating: 4 out of 5 stars4/5Python Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Web Scraping with Python Rating: 4 out of 5 stars4/5NumPy Essentials Rating: 0 out of 5 stars0 ratingsPython Unlocked Rating: 0 out of 5 stars0 ratingsData Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition) Rating: 0 out of 5 stars0 ratingsR High Performance Programming Rating: 4 out of 5 stars4/5Python for Secret Agents Rating: 0 out of 5 stars0 ratingsLearning IPython for Interactive Computing and Data Visualization - Second Edition Rating: 2 out of 5 stars2/5Designing Machine Learning Systems with Python Rating: 0 out of 5 stars0 ratingsPython Tools for Visual Studio Rating: 0 out of 5 stars0 ratingsLearning Jupyter Rating: 5 out of 5 stars5/5Modular Programming with Python Rating: 0 out of 5 stars0 ratings
Computers For You
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance Rating: 0 out of 5 stars0 ratingsThe Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5People Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice Rating: 0 out of 5 stars0 ratingsCreating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5
Reviews for Learning Data Mining with Python - Second Edition
0 ratings0 reviews
Book preview
Learning Data Mining with Python - Second Edition - Robert Layton
Title Page
Learning Data Mining with Python
Second Edition
Use Python to manipulate data and build predictive models
Robert Layton
BIRMINGHAM - MUMBAI
Copyright
Learning Data Mining with Python
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2015
Second edition: April 2017
Production reference: 1250417
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78712-678-7
www.packtpub.com
Credits
About the Author
Robert Layton is a data scientist investigating data-driven applications to businesses across a number of sectors. He received a PhD investigating cybercrime analytics from the Internet Commerce Security Laboratory at Federation University Australia, before moving into industry, starting his own data analytics company dataPipeline (www.datapipeline.com.au). Next, he created Eureaktive (www.eureaktive.com.au), which works with tech-based startups on developing their proof-of-concepts and early-stage prototypes. Robert also runs www.learningtensorflow.com, which is one of the world's premier tutorial websites for Google's TensorFlow library.
Robert is an active member of the Python community, having used Python for more than 8 years. He has presented at PyConAU for the last four years and works with Python Charmers to provide Python-based training for businesses and professionals from a wide range of organisations.
Robert can be best reached via Twitter @robertlayton
Thank you to my family for supporting me on this journey, thanks to all the readers of revision 1 for making it a success, and thanks to Matty for his assistance behind-the-scenes with the book.
About the Reviewer
Asad Ahamad is a data enthusiast and loves to work on data to solve challenging problems.
He did his masters in Industrial Mathematics with Computer Application from Jamia Millia Islamia, New Delhi. He admires Mathematics a lot and always tries to use it to gain maximum profit for business.
He has good experience working on data mining, machine learning and data science and worked for various multinationals in India. He mainly uses R and Python to perform data wrangling and modeling. He is fond of using open source tools for data analysis.
He is active social media user. Feel free to connect him on twitter @asadtaj88
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787126781.
If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Getting Started with Data Mining
Introducing data mining
Using Python and the Jupyter Notebook
Installing Python
Installing Jupyter Notebook
Installing scikit-learn
A simple affinity analysis example
What is affinity analysis?
Product recommendations
Loading the dataset with NumPy
Downloading the example code
Implementing a simple ranking of rules
Ranking to find the best rules
A simple classification example
What is classification?
Loading and preparing the dataset
Implementing the OneR algorithm
Testing the algorithm
Summary
Classifying with scikit-learn Estimators
scikit-learn estimators
Nearest neighbors
Distance metrics
Loading the dataset
Moving towards a standard workflow
Running the algorithm
Setting parameters
Preprocessing
Standard pre-processing
Putting it all together
Pipelines
Summary
Predicting Sports Winners with Decision Trees
Loading the dataset
Collecting the data
Using pandas to load the dataset
Cleaning up the dataset
Extracting new features
Decision trees
Parameters in decision trees
Using decision trees
Sports outcome prediction
Putting it all together
Random forests
How do ensembles work?
Setting parameters in Random Forests
Applying random forests
Engineering new features
Summary
Recommending Movies Using Affinity Analysis
Affinity analysis
Algorithms for affinity analysis
Overall methodology
Dealing with the movie recommendation problem
Obtaining the dataset
Loading with pandas
Sparse data formats
Understanding the Apriori algorithm and its implementation
Looking into the basics of the Apriori algorithm
Implementing the Apriori algorithm
Extracting association rules
Evaluating the association rules
Summary
Features and scikit-learn Transformers
Feature extraction
Representing reality in models
Common feature patterns
Creating good features
Feature selection
Selecting the best individual features
Feature creation
Principal Component Analysis
Creating your own transformer
The transformer API
Implementing a Transformer
Unit testing
Putting it all together
Summary
Social Media Insight using Naive Bayes
Disambiguation
Downloading data from a social network
Loading and classifying the dataset
Creating a replicable dataset from Twitter
Text transformers
Bag-of-words models
n-gram features
Other text features
Naive Bayes
Understanding Bayes' theorem
Naive Bayes algorithm
How it works
Applying of Naive Bayes
Extracting word counts
Converting dictionaries to a matrix
Putting it all together
Evaluation using the F1-score
Getting useful features from models
Summary
Follow Recommendations Using Graph Mining
Loading the dataset
Classifying with an existing model
Getting follower information from Twitter
Building the network
Creating a graph
Creating a similarity graph
Finding subgraphs
Connected components
Optimizing criteria
Summary
Beating CAPTCHAs with Neural Networks
Artificial neural networks
An introduction to neural networks
Creating the dataset
Drawing basic CAPTCHAs
Splitting the image into individual letters
Creating a training dataset
Training and classifying
Back-propagation
Predicting words
Improving accuracy using a dictionary
Ranking mechanisms for word similarity
Putting it all together
Summary
Authorship Attribution
Attributing documents to authors
Applications and use cases
Authorship attribution
Getting the data
Using function words
Counting function words
Classifying with function words
Support Vector Machines
Classifying with SVMs
Kernels
Character n-grams
Extracting character n-grams
The Enron dataset
Accessing the Enron dataset
Creating a dataset loader
Putting it all together
Evaluation
Summary
Clustering News Articles
Trending topic discovery
Using a web API to get data
Reddit as a data source
Getting the data
Extracting text from arbitrary websites
Finding the stories in arbitrary websites
Extracting the content
Grouping news articles
The k-means algorithm
Evaluating the results
Extracting topic information from clusters
Using clustering algorithms as transformers
Clustering ensembles
Evidence accumulation
How it works
Implementation
Online learning
Implementation
Summary
Object Detection in Images using Deep Neural Networks
Object classification
Use cases
Application scenario
Deep neural networks
Intuition
Implementing deep neural networks
An Introduction to TensorFlow
Using Keras
Convolutional Neural Networks
GPU optimization
When to use GPUs for computation
Running our code on a GPU
Setting up the environment
Application
Getting the data
Creating the neural network
Putting it all together
Summary
Working with Big Data
Big data
Applications of big data
MapReduce
The intuition behind MapReduce
A word count example
Hadoop MapReduce
Applying MapReduce
Getting the data
Naive Bayes prediction
The mrjob package
Extracting the blog posts
Training Naive Bayes
Putting it all together
Training on Amazon's EMR infrastructure
Summary
Next Steps...
Getting Started with Data Mining
Scikit-learn tutorials
Extending the Jupyter Notebook
More datasets
Other Evaluation Metrics
More application ideas
Classifying with scikit-learn Estimators
Scalability with the nearest neighbor
More complex pipelines
Comparing classifiers
Automated Learning
Predicting Sports Winners with Decision Trees
More complex features
Dask
Research
Recommending Movies Using Affinity Analysis
New datasets
The Eclat algorithm
Collaborative Filtering
Extracting Features with Transformers
Adding noise
Vowpal Wabbit
word2vec
Social Media Insight Using Naive Bayes
Spam detection
Natural language processing and part-of-speech tagging
Discovering Accounts to Follow Using Graph Mining
More complex algorithms
NetworkX
Beating CAPTCHAs with Neural Networks
Better (worse?) CAPTCHAs
Deeper networks
Reinforcement learning
Authorship Attribution
Increasing the sample size
Blogs dataset
Local n-grams
Clustering News Articles
Clustering Evaluation
Temporal analysis
Real-time clusterings
Classifying Objects in Images Using Deep Learning
Mahotas
Magenta
Working with Big Data
Courses on Hadoop
Pydoop
Recommendation engine
W.I.L.L
More resources
Kaggle competitions
Coursera
Preface
The second revision of Learning Data Mining with Python was written with the programmer in mind. It aims to introduce data mining to a wide range of programmers, as I feel that this is critically important to all those in the computer science field. Data mining is quickly becoming the building block of the next generation of Artificial Intelligence systems. Even if you don't find yourself building these systems, you will be using them, interfacing with them, and being guided by them. Understand the process behind them is important and helps you get the best out of them.
The second revision builds upon the first. Many of chapters and exercises are similar, although new concepts are introduced and exercises are expanded in scope. Those that had read the first revision should be able to move quickly through the book and pick up new knowledge along the way and engage with the extra activities proposed. Those new to the book are encouraged to take their time, do the exercises and experiment. Feel free to break the code to understand it, and reach out if you have any questions.
As this is a book aimed at programmers, we assume that you have some knowledge of programming and of Python itself. For this reason, there is little explanation of what the Python code itself is doing, except in cases where it is ambiguous.
What this book covers
Chapter 1, Getting started with data mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.
Chapter 2, Classifying with scikit-learn, covers classification, a key form of data mining. You’ll also learn about some structures for making your data mining experimentation easier to perform..
Chapter 3, Predicting Sports Winners with Decisions Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses it to predict sports winners by creating useful features..
Chapter 4, Recommending Movies using Affinity Analysis, looks at the problem of recommending products based on past experience, and introduces the Apriori algorithm.
Chapter 5, Features and scikit-learn Transformers, introduces more types of features you can create, and how to work with different datasets.
Chapter 6, Social Media Insight using Naive Bayes, uses the Naïve Bayes algorithm to automatically parse text-based information from the social media website Twitter.
Chapter 7, Follow Recommendations Using Graph Mining, applies cluster analysis and network analysis to find good people to follow on social media.
Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images, and then training neural networks to find words and letters in those images.
Chapter 9, Authorship attribution, looks at determining who wrote a given documents, by extracting text-based features and using Support Vector Machines.
Chapter 10, Clustering news articles, uses the k-means clustering algorithm to group together news articles based on their content.
Chapter 11, Object Detection in Images using Deep Neural Networks, determines what type of object is being shown in an image, by applying deep neural networks.
Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.
Appendix, Next step, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.
What you need for this book
It should come as no surprise that you’ll need a computer, or access to one, to complete the book. The computer should be reasonably modern, but it doesn’t need to be overpowered. Any modern processor (from about 2010 onwards) and 4 gigabytes of RAM will suffice, and you can probably run almost all of the code on a slower system too.
The exception here is with the final two chapters. In these chapters, I step through using Amazon’s web services (AWS) for running the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don’t want to pay for those services, the tools used can all be set-up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012, and more than 4 GB of RAM are necessary.
I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed though.
In this book, I use pip for installing code, which is a command line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads
I also have tested all code using Python 3. Most of the code examples work on Python 2 with no changes. If you run into any problems, and can’t get around it, send an email and we can offer a solution.
Who this book is for
This book is for programmers that want to get started in data mining in an application-focused manner.
If you haven’t programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn’t introduce programming, nor does it give too much time to explaining the actual implementation (in-code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly – there is no need to be an expert programmer first!
I highly recommend that you have some Python programming experience. If you don’t, feel free to jump in, but you might want to take a look at some Python code first, possibly focused on tutorials using the IPython notebook. Writing programs in the IPython notebook works a little differently than other methods, such as writing a Java program in a fully-fledged IDE.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: The next lines of code read the link and assign it to the to the dataset_filename function.
A block of code is set as follows:
Any command-line input or output is written as follows:
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter.
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Data-Mining-with-Python-Second-Edition. The benefit of the github repository is that any issues with the code, including problems relating to software version changes, will be kept track of and the code there will include changes from readers around the world. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
To avoid indention issues please use the code bundle to run the codes in the IDE instead of copying directly from the PDF.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.
Getting Started with Data Mining
We are collecting information about our world on a scale that has never been seen before in the history of humanity. Along with this trend, we are now placing more day-to-day importance on the use of this information in everyday life. We now expect our computers to translate web pages into other languages, predict the weather with high accuracy, suggest books we would like, and to diagnose our health issues. These expectations will grow into the future, both in application breadth and efficacy. Data Mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.
The Python programming language is fast growing in popularity, for a good reason. It gives the programmer flexibility, it has many modules to perform different tasks, and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.
In this chapter, we will introduce data mining with Python. We will cover the following topics
What is data mining and where can we use it?
Setting up a Python-based environment to perform data mining
An example of affinity analysis, recommending products based on purchasing habits
An example of (a classic) classification problem, predicting the plant species based on its measurement
Introducing data mining
Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.
Data mining is part algorithm design, statistics, engineering, optimization, and computer science. However, combined with these base skills in the area, we also need to apply domain knowledge (expert knowledge)of the area we are applying the data mining. Domain knowledge is critical for going from good results to great results. Applying data mining effectively usually requires this domain-specific knowledge to be integrated with the algorithms.
Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.
Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps
We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:
Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.
Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.
The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.
As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:
As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.
In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.
In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.
Using Python and the Jupyter Notebook
In this section, we will cover installing Python and the environment that we will use for most of the book, the Jupyter Notebook. Furthermore, we will install the NumPy module, which we will use for the first set of examples.
The Jupyter Notebook was, until very recently, called the IPython Notebook. You'll notice the term in web searches for the project. Jupyter is the new name, representing a broadening of the project beyond using just Python.
Installing Python
The Python programming language is a fantastic, versatile, and an easy to use language.
For this book, we will be using Python 3.5, which is available for your system from the Python Organization's website https://www.python.org/downloads/. However, I recommend that you use Anaconda to install Python, which you can download from the official website at https://www.continuum.io/downloads.
There will be two major versions to choose from, Python 3.5 and Python 2.7. Remember to download and install Python 3.5, which is the version tested throughout this book. Follow the installation instructions on that website for your system. If you have a strong reason to learn version 2 of Python, then do so by downloading the Python 2.7 version. Keep in mind that some code may not work as in the book, and some workarounds may be needed.
In this book, I assume that you have some knowledge of programming and Python itself. You do not need to be an expert with Python to complete this book, although a good level of knowledge will help. I will not be explaining general code structures and syntax in this book, except where it is different from what is considered normal python coding practice.
If you do not have any experience with programming, I recommend that you pick up the Learning Python book from Packt Publishing, or the book Dive Into Python, available online at www.diveintopython3.net
The Python organization also maintains a list of two online tutorials for those new to Python:
For non-programmers who want to learn to program through the Python language:
https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
For programmers who already know how to program, but need to learn Python specifically:
https://wiki.python.org/moin/BeginnersGuide/Programmers
Windows users will need to set an environment variable to use Python from the command line, where other systems will usually be immediately executable. We set it in the following steps
First, find where you install Python 3 onto your computer; the default location is C:\Python35.
Next, enter this command into the command line (cmd program): set the environment to PYTHONPATH=%PYTHONPATH%;C:\Python35.
Remember to change the C:\Python35 if your installation of Python is in a different folder.
Once you have Python running on your system, you should be able to open a command prompt and can run