Learning Data Mining with Python - Second Edition

Ebook575 pages5 hours

Learning Data Mining with Python - Second Edition

Name: Learning Data Mining with Python - Second Edition
Author: Robert Layton
ISBN: 9781787129566

By Robert Layton

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Use a wide variety of Python libraries for practical data mining purposes.
Learn how to find, manipulate, analyze, and visualize data using Python.
Step-by-step instructions on data mining techniques with Python that have real-world applications.

Who This Book Is For

If you are a Python programmer who wants to get started with data mining, then this book is for you. If you are a data analyst who wants to leverage the power of Python to perform data mining efficiently, this book will also help you. No previous experience with data mining is expected.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateApr 27, 2017

ISBN9781787129566

Author

Robert Layton

Dr. Robert Layton is a Research Fellow at the Internet Commerce Security Laboratory (ICSL) at Federation University Australia. Dr Layton’s research focuses on attribution technologies on the internet, including automating open source intelligence (OSINT) and attack attribution. Dr Layton’s research has led to improvements in authorship analysis methods for unstructured text, providing indirect methods of linking profiles on social media.

Related to Learning Data Mining with Python - Second Edition

Related ebooks

Skip carousel

Mastering Social Media Mining with Python
Ebook
Mastering Social Media Mining with Python
byMarco Bonzanini
Rating: 5 out of 5 stars
5/5
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
Ebook
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
byStefanie Molin
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Interactive Applications Using Matplotlib
Ebook
Interactive Applications Using Matplotlib
byBenjamin V. Root
Rating: 0 out of 5 stars
0 ratings
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings
Learning pandas - Second Edition
Ebook
Learning pandas - Second Edition
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Mastering Python Regular Expressions
Ebook
Mastering Python Regular Expressions
byVictor Romero
Rating: 5 out of 5 stars
5/5
Pandas 1.x Cookbook - Second Edition: Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python, 2nd Edition
Ebook
Pandas 1.x Cookbook - Second Edition: Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python, 2nd Edition
byMatt Harrison
Rating: 5 out of 5 stars
5/5
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Ebook
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
byAnish Chapagain
Rating: 0 out of 5 stars
0 ratings
Mastering Python Data Analysis
Ebook
Mastering Python Data Analysis
byMagnus Vilhelm Persson
Rating: 0 out of 5 stars
0 ratings
Python Web Scraping - Second Edition
Ebook
Python Web Scraping - Second Edition
byKatharine Jarmul
Rating: 5 out of 5 stars
5/5
Practical Data Science Cookbook - Second Edition
Ebook
Practical Data Science Cookbook - Second Edition
byTony Ojeda
Rating: 0 out of 5 stars
0 ratings
Regression Analysis with Python
Ebook
Regression Analysis with Python
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Python Data Science Essentials - Second Edition
Ebook
Python Data Science Essentials - Second Edition
byBoschetti Alberto
Rating: 4 out of 5 stars
4/5
Web Scraping with Python
Ebook
Web Scraping with Python
byRichard Lawson
Rating: 4 out of 5 stars
4/5
NumPy Essentials
Ebook
NumPy Essentials
byLeo (Liang-Huan) Chin
Rating: 0 out of 5 stars
0 ratings
Python Unlocked
Ebook
Python Unlocked
byTigeraniya Arun
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
Ebook
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
byPurna Chander Rao. Kathula
Rating: 5 out of 5 stars
5/5
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
Ebook
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
byRituraj Dixit
Rating: 0 out of 5 stars
0 ratings
R High Performance Programming
Ebook
R High Performance Programming
byAloysius Lim
Rating: 4 out of 5 stars
4/5
Python for Secret Agents
Ebook
Python for Secret Agents
bySteven F. Lott
Rating: 0 out of 5 stars
0 ratings
Learning IPython for Interactive Computing and Data Visualization - Second Edition
Ebook
Learning IPython for Interactive Computing and Data Visualization - Second Edition
byRossant Cyrille
Rating: 2 out of 5 stars
2/5
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
Python Tools for Visual Studio
Ebook
Python Tools for Visual Studio
byMartino Sabia
Rating: 0 out of 5 stars
0 ratings
Learning Jupyter
Ebook
Learning Jupyter
byDan Toomey
Rating: 5 out of 5 stars
5/5
Modular Programming with Python
Ebook
Modular Programming with Python
byErik Westra
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
Podcast episode
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
Advantages of Completing Small Python Projects
Podcast episode
Advantages of Completing Small Python Projects
byThe Real Python Podcast
0 ratings
0% found this document useful
Unraveling Python's Syntax to Its Core With Brett Cannon
Podcast episode
Unraveling Python's Syntax to Its Core With Brett Cannon
byThe Real Python Podcast
100%
100% found this document useful
Learning Python Through Errors
Podcast episode
Learning Python Through Errors
byThe Real Python Podcast
0 ratings
0% found this document useful
Improving the Learning Experience on Real Python
Podcast episode
Improving the Learning Experience on Real Python
byThe Real Python Podcast
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
Podcast episode
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
Building a Platform Game With Arcade and Covering Python News Monthly
Podcast episode
Building a Platform Game With Arcade and Covering Python News Monthly
byThe Real Python Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
Podcast episode
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
Podcast episode
Anaconda + Pyston and more: with Peter Wang, CEO of Anaconda
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
Podcast episode
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
Going Beyond the Basic Stuff With Python and Al Sweigart
Podcast episode
Going Beyond the Basic Stuff With Python and Al Sweigart
byThe Real Python Podcast
0 ratings
0% found this document useful
MLA 018 Descript: (Optional episode) just showcasing a cool application using machine learning Dept uses Descript for some of their podcasting. I'm using it like a maniac, I think they're surprised at how into it I am. Check out the transcript & see how it...
Podcast episode
MLA 018 Descript: (Optional episode) just showcasing a cool application using machine learning Dept uses Descript for some of their podcasting. I'm using it like a maniac, I think they're surprised at how into it I am. Check out the transcript & see how it...
byMachine Learning Guide
0 ratings
0% found this document useful
Planning a Faster Future at the Python Language Summit
Podcast episode
Planning a Faster Future at the Python Language Summit
byThe Real Python Podcast
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
Podcast episode
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Linear Programming, PySimpleGUI, and More
Podcast episode
Linear Programming, PySimpleGUI, and More
byThe Real Python Podcast
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
Power Up Your Java Using Python With JPype - Episode 286: An interview with Karl Nelson about using the JPype library for bridging the Java and Python ecosystems for scientific computing
Podcast episode
Power Up Your Java Using Python With JPype - Episode 286: An interview with Karl Nelson about using the JPype library for bridging the Java and Python ecosystems for scientific computing
byThe Python Podcast.__init__
0 ratings
0% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
Podcast episode
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
byData Skeptic
0 ratings
0% found this document useful
Getting Started in Python Cybersecurity and Forensics
Podcast episode
Getting Started in Python Cybersecurity and Forensics
byThe Real Python Podcast
0 ratings
0% found this document useful
Create Web Applications Using Only Python With Anvil
Podcast episode
Create Web Applications Using Only Python With Anvil
byThe Real Python Podcast
0 ratings
0% found this document useful
Making Automated Machine Learning More Accessible With EvalML: An interview with Angela Lin and Jeremy Shih about the open source EvalML framework for building automated machine learning workflows.
Podcast episode
Making Automated Machine Learning More Accessible With EvalML: An interview with Angela Lin and Jeremy Shih about the open source EvalML framework for building automated machine learning workflows.
byThe Python Podcast.__init__
100%
100% found this document useful
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
Podcast episode
78: Mindset of a Rockstar Data Analyst w/ Trevor Tapscott: Our focus for this inspiring episode of AOF is mindset, especially if you want to be a standout data analyst! I have brought one of my first ever followers and day ones! Trevor Tapscott is a VP and Analytics Consultant at Wells Fargo and has been in...
byAnalytics on Fire
0 ratings
0% found this document useful
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
Podcast episode
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
Generators, Coroutines, and Learning Python Through Exercises
Podcast episode
Generators, Coroutines, and Learning Python Through Exercises
byThe Real Python Podcast
0 ratings
0% found this document useful
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
Podcast episode
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
byCoRecursive: Coding Stories
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Make AI Work For You
Linux Format
Article
Make AI Work For You
Apr 2, 2024
8 min read
A KiwiGPT?
New Zealand Listener
Article
A KiwiGPT?
Jun 25, 2023
2 min read
How An A.i. Chatbot Works
Muse: The magazine of science, culture, and smart laughs for kids and children
Article
How An A.i. Chatbot Works
Feb 1, 2024
1 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Artificial Intelligence Rules Of The Road
Linux Format
Article
Artificial Intelligence Rules Of The Road
Nov 14, 2023
AI FOR ALL! Anyone who works with computers needs to understand that AI will undoubtedly change how work is executed. That said, I don’t think we are anywhere near the much bleated “Everyone will lose their jobs!” IT-related jobs will change but they
2 min read
Seven Questions About Chatgpt Answered
NZBusiness and Management
Article
Seven Questions About Chatgpt Answered
Apr 18, 2023
3 min read
ParrotOS
Linux Format
Article
ParrotOS
Apr 2, 2024
2 min read
The Deep Learning Revolution For Artificial Intelligence
Facility Management
Article
The Deep Learning Revolution For Artificial Intelligence
Mar 28, 2019
3 min read
Readers’ Comments
PC Pro Magazine
Article
Readers’ Comments
Dec 8, 2022
3 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Family History In The AI Era
Family Tree UK
Article
Family History In The AI Era
Apr 12, 2024
7 min read
Smart Answers: GenAI Tool Makes It Easier To Find The Info You Need On PCWorld
PCWorld
Article
Smart Answers: GenAI Tool Makes It Easier To Find The Info You Need On PCWorld
Sep 5, 2023
4 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Inside APC
APC
Article
Inside APC
Oct 4, 2021
2 min read
HOW WE DO IT Inside APC
APC
Article
HOW WE DO IT Inside APC
Jun 13, 2022
2 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Inside APC
APC
Article
Inside APC
Jan 24, 2022
2 min read
Inside APC
APC
Article
Inside APC
Dec 27, 2021
2 min read
Inside APC
APC
Article
Inside APC
Sep 6, 2021
2 min read
Inside APC
APC
Article
Inside APC
Nov 1, 2021
2 min read
Inside APC
APC
Article
Inside APC
May 16, 2022
2 min read

Related categories

Skip carousel

Reviews for Learning Data Mining with Python - Second Edition

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Learning Data Mining with Python - Second Edition - Robert Layton

Title Page

Learning Data Mining with Python

Second Edition

Use Python to manipulate data and build predictive models

Robert Layton

BIRMINGHAM - MUMBAI

Copyright

Learning Data Mining with Python

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: July 2015

Second edition: April 2017

Production reference: 1250417

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78712-678-7

www.packtpub.com

Credits

About the Author

Robert Layton is a data scientist investigating data-driven applications to businesses across a number of sectors. He received a PhD investigating cybercrime analytics from the Internet Commerce Security Laboratory at Federation University Australia, before moving into industry, starting his own data analytics company dataPipeline (www.datapipeline.com.au). Next, he created Eureaktive (www.eureaktive.com.au), which works with tech-based startups on developing their proof-of-concepts and early-stage prototypes. Robert also runs www.learningtensorflow.com, which is one of the world's premier tutorial websites for Google's TensorFlow library.

Robert is an active member of the Python community, having used Python for more than 8 years. He has presented at PyConAU for the last four years and works with Python Charmers to provide Python-based training for businesses and professionals from a wide range of organisations.

Robert can be best reached via Twitter @robertlayton

Thank you to my family for supporting me on this journey, thanks to all the readers of revision 1 for making it a success, and thanks to Matty for his assistance behind-the-scenes with the book.

About the Reviewer

Asad Ahamad is a data enthusiast and loves to work on data to solve challenging problems.

He did his masters in Industrial Mathematics with Computer Application from Jamia Millia Islamia, New Delhi. He admires Mathematics a lot and always tries to use it to gain maximum profit for business.

He has good experience working on data mining, machine learning and data science and worked for various multinationals in India. He mainly uses R and Python to perform data wrangling and modeling. He is fond of using open source tools for data analysis.

He is active social media user. Feel free to connect him on twitter @asadtaj88

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787126781.

If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Getting Started with Data Mining

Introducing data mining

Using Python and the Jupyter Notebook

Installing Python

Installing Jupyter Notebook

Installing scikit-learn

A simple affinity analysis example

What is affinity analysis?

Product recommendations

Loading the dataset with NumPy

Downloading the example code

Implementing a simple ranking of rules

Ranking to find the best rules

A simple classification example

What is classification?

Loading and preparing the dataset

Implementing the OneR algorithm

Testing the algorithm

Summary

Classifying with scikit-learn Estimators

scikit-learn estimators

Nearest neighbors

Distance metrics

Loading the dataset

Moving towards a standard workflow

Running the algorithm

Setting parameters

Preprocessing

Standard pre-processing

Putting it all together

Pipelines

Summary

Predicting Sports Winners with Decision Trees

Loading the dataset

Collecting the data

Using pandas to load the dataset

Cleaning up the dataset

Extracting new features

Decision trees

Parameters in decision trees

Using decision trees

Sports outcome prediction

Putting it all together

Random forests

How do ensembles work?

Setting parameters in Random Forests

Applying random forests

Engineering new features

Summary

Recommending Movies Using Affinity Analysis

Affinity analysis

Algorithms for affinity analysis

Overall methodology

Dealing with the movie recommendation problem

Obtaining the dataset

Loading with pandas

Sparse data formats

Understanding the Apriori algorithm and its implementation

Looking into the basics of the Apriori algorithm

Implementing the Apriori algorithm

Extracting association rules

Evaluating the association rules

Summary

Features and scikit-learn Transformers

Feature extraction

Representing reality in models

Common feature patterns

Creating good features

Feature selection

Selecting the best individual features

Feature creation

Principal Component Analysis

Creating your own transformer

The transformer API

Implementing a Transformer

Unit testing

Putting it all together

Summary

Social Media Insight using Naive Bayes

Disambiguation

Downloading data from a social network

Loading and classifying the dataset

Creating a replicable dataset from Twitter

Text transformers

Bag-of-words models

n-gram features

Other text features

Naive Bayes

Understanding Bayes' theorem

Naive Bayes algorithm

How it works

Applying of Naive Bayes

Extracting word counts

Converting dictionaries to a matrix

Putting it all together

Evaluation using the F1-score

Getting useful features from models

Summary

Follow Recommendations Using Graph Mining

Loading the dataset

Classifying with an existing model

Getting follower information from Twitter

Building the network

Creating a graph

Creating a similarity graph

Finding subgraphs

Connected components

Optimizing criteria

Summary

Beating CAPTCHAs with Neural Networks

Artificial neural networks

An introduction to neural networks

Creating the dataset

Drawing basic CAPTCHAs

Splitting the image into individual letters

Creating a training dataset

Training and classifying

Back-propagation

Predicting words

Improving accuracy using a dictionary

Ranking mechanisms for word similarity

Putting it all together

Summary

Authorship Attribution

Attributing documents to authors

Applications and use cases

Authorship attribution

Getting the data

Using function words

Counting function words

Classifying with function words

Support Vector Machines

Classifying with SVMs

Kernels

Character n-grams

Extracting character n-grams

The Enron dataset

Accessing the Enron dataset

Creating a dataset loader

Putting it all together

Evaluation

Summary

Clustering News Articles

Trending topic discovery

Using a web API to get data

Reddit as a data source

Getting the data

Extracting text from arbitrary websites

Finding the stories in arbitrary websites

Extracting the content

Grouping news articles

The k-means algorithm

Evaluating the results

Extracting topic information from clusters

Using clustering algorithms as transformers

Clustering ensembles

Evidence accumulation

How it works

Implementation

Online learning

Implementation

Summary

Object Detection in Images using Deep Neural Networks

Object classification

Use cases

Application scenario

Deep neural networks

Intuition

Implementing deep neural networks

An Introduction to TensorFlow

Using Keras

Convolutional Neural Networks

GPU optimization

When to use GPUs for computation

Running our code on a GPU

Setting up the environment

Application

Getting the data

Creating the neural network

Putting it all together

Summary

Working with Big Data

Big data

Applications of big data

MapReduce

The intuition behind MapReduce

A word count example

Hadoop MapReduce

Applying MapReduce

Getting the data

Naive Bayes prediction

The mrjob package

Extracting the blog posts

Training Naive Bayes

Putting it all together

Training on Amazon's EMR infrastructure

Summary

Next Steps...

Getting Started with Data Mining

Scikit-learn tutorials

Extending the Jupyter Notebook

More datasets

Other Evaluation Metrics

More application ideas

Classifying with scikit-learn Estimators

Scalability with the nearest neighbor

More complex pipelines

Comparing classifiers

Automated Learning

Predicting Sports Winners with Decision Trees

More complex features

Dask

Research

Recommending Movies Using Affinity Analysis

New datasets

The Eclat algorithm

Collaborative Filtering

Extracting Features with Transformers

Adding noise

Vowpal Wabbit

word2vec

Social Media Insight Using Naive Bayes

Spam detection

Natural language processing and part-of-speech tagging

Discovering Accounts to Follow Using Graph Mining

More complex algorithms

NetworkX

Beating CAPTCHAs with Neural Networks

Better (worse?) CAPTCHAs

Deeper networks

Reinforcement learning

Authorship Attribution

Increasing the sample size

Blogs dataset

Local n-grams

Clustering News Articles

Clustering Evaluation

Temporal analysis

Real-time clusterings

Classifying Objects in Images Using Deep Learning

Mahotas

Magenta

Working with Big Data

Courses on Hadoop

Pydoop

Recommendation engine

W.I.L.L

More resources

Kaggle competitions

Coursera

Preface

The second revision of Learning Data Mining with Python was written with the programmer in mind. It aims to introduce data mining to a wide range of programmers, as I feel that this is critically important to all those in the computer science field. Data mining is quickly becoming the building block of the next generation of Artificial Intelligence systems. Even if you don't find yourself building these systems, you will be using them, interfacing with them, and being guided by them. Understand the process behind them is important and helps you get the best out of them.

The second revision builds upon the first. Many of chapters and exercises are similar, although new concepts are introduced and exercises are expanded in scope. Those that had read the first revision should be able to move quickly through the book and pick up new knowledge along the way and engage with the extra activities proposed. Those new to the book are encouraged to take their time, do the exercises and experiment. Feel free to break the code to understand it, and reach out if you have any questions.

As this is a book aimed at programmers, we assume that you have some knowledge of programming and of Python itself. For this reason, there is little explanation of what the Python code itself is doing, except in cases where it is ambiguous.

What this book covers

Chapter 1, Getting started with data mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.

Chapter 2, Classifying with scikit-learn, covers classification, a key form of data mining. You’ll also learn about some structures for making your data mining experimentation easier to perform..

Chapter 3, Predicting Sports Winners with Decisions Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses it to predict sports winners by creating useful features..

Chapter 4, Recommending Movies using Affinity Analysis, looks at the problem of recommending products based on past experience, and introduces the Apriori algorithm.

Chapter 5, Features and scikit-learn Transformers, introduces more types of features you can create, and how to work with different datasets.

Chapter 6, Social Media Insight using Naive Bayes, uses the Naïve Bayes algorithm to automatically parse text-based information from the social media website Twitter.

Chapter 7, Follow Recommendations Using Graph Mining, applies cluster analysis and network analysis to find good people to follow on social media.

Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images, and then training neural networks to find words and letters in those images.

Chapter 9, Authorship attribution, looks at determining who wrote a given documents, by extracting text-based features and using Support Vector Machines.

Chapter 10, Clustering news articles, uses the k-means clustering algorithm to group together news articles based on their content.

Chapter 11, Object Detection in Images using Deep Neural Networks, determines what type of object is being shown in an image, by applying deep neural networks.

Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.

Appendix, Next step, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.

What you need for this book

It should come as no surprise that you’ll need a computer, or access to one, to complete the book. The computer should be reasonably modern, but it doesn’t need to be overpowered. Any modern processor (from about 2010 onwards) and 4 gigabytes of RAM will suffice, and you can probably run almost all of the code on a slower system too.

The exception here is with the final two chapters. In these chapters, I step through using Amazon’s web services (AWS) for running the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don’t want to pay for those services, the tools used can all be set-up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012, and more than 4 GB of RAM are necessary.

I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed though.

In this book, I use pip for installing code, which is a command line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads

I also have tested all code using Python 3. Most of the code examples work on Python 2 with no changes. If you run into any problems, and can’t get around it, send an email and we can offer a solution.

Who this book is for

This book is for programmers that want to get started in data mining in an application-focused manner.

If you haven’t programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn’t introduce programming, nor does it give too much time to explaining the actual implementation (in-code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly – there is no need to be an expert programmer first!

I highly recommend that you have some Python programming experience. If you don’t, feel free to jump in, but you might want to take a look at some Python code first, possibly focused on tutorials using the IPython notebook. Writing programs in the IPython notebook works a little differently than other methods, such as writing a Java program in a fully-fledged IDE.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: The next lines of code read the link and assign it to the to the dataset_filename function.

A block of code is set as follows:

Any command-line input or output is written as follows:

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter.

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Data-Mining-with-Python-Second-Edition. The benefit of the github repository is that any issues with the code, including problems relating to software version changes, will be kept track of and the code there will include changes from readers around the world. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

To avoid indention issues please use the code bundle to run the codes in the IDE instead of copying directly from the PDF.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

Getting Started with Data Mining

We are collecting information about our world on a scale that has never been seen before in the history of humanity. Along with this trend, we are now placing more day-to-day importance on the use of this information in everyday life. We now expect our computers to translate web pages into other languages, predict the weather with high accuracy, suggest books we would like, and to diagnose our health issues. These expectations will grow into the future, both in application breadth and efficacy. Data Mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.

The Python programming language is fast growing in popularity, for a good reason. It gives the programmer flexibility, it has many modules to perform different tasks, and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.

In this chapter, we will introduce data mining with Python. We will cover the following topics

What is data mining and where can we use it?

Setting up a Python-based environment to perform data mining

An example of affinity analysis, recommending products based on purchasing habits

An example of (a classic) classification problem, predicting the plant species based on its measurement

Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Data mining is part algorithm design, statistics, engineering, optimization, and computer science. However, combined with these base skills in the area, we also need to apply domain knowledge (expert knowledge)of the area we are applying the data mining. Domain knowledge is critical for going from good results to great results. Applying data mining effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.

Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps

We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:

Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.

Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.

In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

Using Python and the Jupyter Notebook

In this section, we will cover installing Python and the environment that we will use for most of the book, the Jupyter Notebook. Furthermore, we will install the NumPy module, which we will use for the first set of examples.

The Jupyter Notebook was, until very recently, called the IPython Notebook. You'll notice the term in web searches for the project. Jupyter is the new name, representing a broadening of the project beyond using just Python.

Installing Python

The Python programming language is a fantastic, versatile, and an easy to use language.

For this book, we will be using Python 3.5, which is available for your system from the Python Organization's website https://www.python.org/downloads/. However, I recommend that you use Anaconda to install Python, which you can download from the official website at https://www.continuum.io/downloads.

There will be two major versions to choose from, Python 3.5 and Python 2.7. Remember to download and install Python 3.5, which is the version tested throughout this book. Follow the installation instructions on that website for your system. If you have a strong reason to learn version 2 of Python, then do so by downloading the Python 2.7 version. Keep in mind that some code may not work as in the book, and some workarounds may be needed.

In this book, I assume that you have some knowledge of programming and Python itself. You do not need to be an expert with Python to complete this book, although a good level of knowledge will help. I will not be explaining general code structures and syntax in this book, except where it is different from what is considered normal python coding practice.

If you do not have any experience with programming, I recommend that you pick up the Learning Python book from Packt Publishing, or the book Dive Into Python, available online at www.diveintopython3.net

The Python organization also maintains a list of two online tutorials for those new to Python:

For non-programmers who want to learn to program through the Python language:

https://wiki.python.org/moin/BeginnersGuide/NonProgrammers

For programmers who already know how to program, but need to learn Python specifically:

https://wiki.python.org/moin/BeginnersGuide/Programmers

Windows users will need to set an environment variable to use Python from the command line, where other systems will usually be immediately executable. We set it in the following steps

First, find where you install Python 3 onto your computer; the default location is C:\Python35.

Next, enter this command into the command line (cmd program): set the environment to PYTHONPATH=%PYTHONPATH%;C:\Python35.

Remember to change the C:\Python35 if your installation of Python is in a different folder.

Once you have Python running on your system, you should be able to open a command prompt and can run

Enjoying the preview?

Page 1 of 1

Learning Data Mining with Python - Second Edition

About this ebook

Robert Layton

Read more from Robert Layton

Related authors

Related to Learning Data Mining with Python - Second Edition

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Learning Data Mining with Python - Second Edition

What did you think?

Book preview

Learning Data Mining with Python - Second Edition - Robert Layton

Title Page

Learning Data Mining with Python

Second Edition

Use Python to manipulate data and build predictive models

Robert Layton

Copyright

Learning Data Mining with Python

Second Edition

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Getting Started with Data Mining

Introducing data mining

Using Python and the Jupyter Notebook

Installing Python