Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Practical Machine Learning for Data Analysis Using Python
Practical Machine Learning for Data Analysis Using Python
Practical Machine Learning for Data Analysis Using Python
Ebook760 pages4 hours

Practical Machine Learning for Data Analysis Using Python

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Practical Machine Learning for Data Analysis Using Python is a problem solver’s guide for creating real-world intelligent systems. It provides a comprehensive approach with concepts, practices, hands-on examples, and sample code. The book teaches readers the vital skills required to understand and solve different problems with machine learning. It teaches machine learning techniques necessary to become a successful practitioner, through the presentation of real-world case studies in Python machine learning ecosystems. The book also focuses on building a foundation of machine learning knowledge to solve different real-world case studies across various fields, including biomedical signal analysis, healthcare, security, economics, and finance. Moreover, it covers a wide range of machine learning models, including regression, classification, and forecasting. The goal of the book is to help a broad range of readers, including IT professionals, analysts, developers, data scientists, engineers, and graduate students, to solve their own real-world problems.

  • Offers a comprehensive overview of the application of machine learning tools in data analysis across a wide range of subject areas
  • Teaches readers how to apply machine learning techniques to biomedical signals, financial data, and healthcare data
  • Explores important classification and regression algorithms as well as other machine learning techniques
  • Explains how to use Python to handle data extraction, manipulation, and exploration techniques, as well as how to visualize data spread across multiple dimensions and extract useful features
LanguageEnglish
Release dateJun 5, 2020
ISBN9780128213803
Practical Machine Learning for Data Analysis Using Python
Author

Abdulhamit Subasi

Abdulhamit Subasi is a highly specialized expert in the fields of Artificial Intelligence, Machine Learning, and Biomedical Signal and Image Processing. His extensive expertise in applying machine learning across diverse domains is evident in his numerous contributions, including the authorship of multiple book chapters, as well as the publication of a substantial body of research in esteemed journals and conferences. His career has spanned various prestigious institutions, including the Georgia Institute of Technology in Georgia, USA, where he served as a dedicated researcher. In recognition of his outstanding research contributions, Subasi received the prestigious Queen Effat Award for Excellence in Research in May 2018. His academic journey includes a tenure as a Professor of computer science at Effat University in Jeddah, Saudi Arabia, from 2015 to 2020. Since 2020, he has assumed the role of Professor of medical physics at the Faculty of Medicine, University of Turku in Turku, Finland

Related to Practical Machine Learning for Data Analysis Using Python

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Practical Machine Learning for Data Analysis Using Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Practical Machine Learning for Data Analysis Using Python - Abdulhamit Subasi

    Practical Machine Learning for Data Analysis Using Python

    Abdulhamit Subasi

    Professor of Information Systems at Effat University, Jeddah, Saudi Arabia

    Contents

    Cover

    Title page

    Copyright

    Dedication

    Preface

    Acknowledgments

    Chapter 1: Introduction

    Abstract

    1.1. What is machine learning?

    1.2. Machine learning framework

    1.3. Performance evaluation

    1.4. The Python machine learning environment

    1.5. Summary

    Chapter 2: Data preprocessing

    Abstract

    2.1. Introduction

    2.2. Feature extraction and transformation

    2.3. Dimension reduction

    2.4. Clustering for feature extraction and dimension reduction

    Chapter 3: Machine learning techniques

    Abstract

    3.1. Introduction

    3.2. What is machine learning?

    3.3. Python libraries

    3.4. Learning scenarios

    3.5. Supervised learning algorithms

    3.6. Unsupervised learning

    3.7. Reinforcement learning

    3.8. Instance-based learning

    3.9. Summary

    Chapter 4: Classification examples for healthcare

    Abstract

    4.1. Introduction

    4.2. EEG signal analysis

    4.3. EMG signal analysis

    4.4. ECG signal analysis

    4.5. Human activity recognition

    4.6. Microarray gene expression data classification for cancer detection

    4.7. Breast cancer detection

    4.8. Classification of the cardiotocogram data for anticipation of fetal risks

    4.9. Diabetes detection

    4.10. Heart disease detection

    4.11. Diagnosis of chronic kidney disease (CKD)

    4.12. Summary

    Chapter 5: Other classification examples

    Abstract

    5.1. Intrusion detection

    5.2. Phishing website detection

    5.3. Spam e-mail detection

    5.4. Credit scoring

    5.5. credit card fraud detection

    5.6. Handwritten digit recognition using CNN

    5.7. Fashion-MNIST image classification with CNN

    5.8. CIFAR image classification using CNN

    5.9. Text classification

    5.10. Summary

    Chapter 6: Regression examples

    Abstract

    6.1. Introduction

    6.2. Stock market price index return forecasting

    6.3. Inflation forecasting

    6.4. Electrical load forecasting

    6.5. Wind speed forecasting

    6.6. Tourism demand forecasting

    6.7. House prices prediction

    6.8. Bike usage prediction

    6.9. Summary

    Chapter 7: Clustering examples

    Abstract

    7.1. Introduction

    7.2. Clustering

    7.3. The k-means clustering algorithm

    7.4. The k-medoids clustering algorithm

    7.5. Hierarchical clustering

    7.6. The fuzzy c-means clustering algorithm

    7.7. Density-based clustering algorithms

    7.8. The expectation of maximization for Gaussian mixture model clustering

    7.9. Bayesian clustering

    7.10. Silhouette analysis

    7.11. Image segmentation with clustering

    7.12. Feature extraction with clustering

    7.13. Clustering for classification

    7.14. Summary

    Index

    Copyright

    Academic Press is an imprint of Elsevier

    125 London Wall, London EC2Y 5AS, United Kingdom

    525 B Street, Suite 1650, San Diego, CA 92101, United States

    50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States

    The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom

    Copyright © 2020 Elsevier Inc. All rights reserved.

    No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

    This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    Library of Congress Cataloging-in-Publication Data

    A catalog record for this book is available from the Library of Congress

    British Library Cataloguing-in-Publication Data

    A catalogue record for this book is available from the British Library

    ISBN: 978-0-12-821379-7

    For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

    Publisher: Mara Conner

    Editorial Project Manager: Rafael G. Trombaco

    Production Project Manager: Paul Prasad Chandramohan

    Designer: Christian Bilbow

    Typeset by Thomson Digital

    Dedication

    A huge thank to my parents for always expecting me to do my best, and telling me I could accomplish anything, no matter what it was.

    To my wife, Rahime, for her patience and support.

    To my wonderful children, Seyma Nur, Tuba Nur and Muhammed Enes. You are always in my heart and the joys in my life.

    To those who read this book, and appreciate the work that goes into it, thank you. If you have any feedback, please let me know.

    Abdulhamit Subasi

    Preface

    Rapid developments in machine learning solutions and adoption across various sectors of industry enable the learning of complex models of real-world problems from observed (training) data through systemic solutions in different fields. Significant time and effort are required to create effective machine learning models and achieve reliable outcomes. The main project concepts can be grasped by building robust data pipelines and analyzing and visualizing data using feature extraction, selection, and modeling. Therefore, the extensive need for a reliable machine learning solution involves a development framework that not only is suitable for immersive machine learning modeling but also succeeds in preprocessing, visualization, system integration, and robust support for runtime deployment and maintenance setting. Python is an innovative programming language with multipurpose features, simple implementation and integration, an active developer community, and an ever-increasing machine learning ecosystem, contributing to the expanding adoption of machine learning.

    Intelligent structures and data-driven enterprises are becoming a reality, and the developments in techniques and technologies are enabling this to happen. With data being of utmost importance, the market for machine learning and data science practitioners has never been larger than it is now. In fact, the world is facing a shortage of data scientists and machine learning experts. Arguably the most demanding job in the 21st century involves developing some significant expertise in this domain.

    Machine learning techniques are computing algorithms, including artificial neural networks, k-nearest neighbor, support vector machines, decision tree algorithms, and deep learning. Machine learning applications are currently of great interest in economics, security, healthcare, biomedicine, and biomedical engineering. This book describes how to use machine learning techniques to analyze the data in these fields.

    The author of this book has a great deal of practical experience in the implementation of real-world problems utilizing Python and its machine learning ecosystem. Practical Machine Learning for Data Analysis Using Python aims to improve the skill levels of readers and qualify them to create practical machine learning solutions. Moreover, this book is a problem solver’s guide for building intelligent real-world systems. It offers a systematic framework that includes principles, procedures, practical examples, and code. The book also contributes to the critical skills needed by its readers to understand and solve various machine learning problems.

    This book is an excellent reference for readers developing machine learning techniques by using real-world case studies in the Python machine learning environment. It focuses on building a foundation of machine learning knowledge to solve different case studies from different fields in the real world, including biomedical signal analysis, healthcare, security, economy, and finance. In addition, it focuses on a broad variety of models for machine learning, including regression, classification, clustering, and forecasting.

    This book consists of seven chapters. Chapter 1 gives an introduction to data analysis using machine learning techniques. Chapter 2 provides an overview of data preprocessing such as feature extraction, transformation, feature selection, and dimension reduction. Chapter 3 offers an overview of machine learning techniques such as naïve Bayes, k-nearest neighbor, artificial neural networks, support vector machines, decision tree, random forest, bagging, boosting, stacking, voting, deep neural network, recurrent neural network, and convolutional neural networks, for forecasting, prediction, and classification. Chapter 4 presents classification examples for healthcare. It includes electrocardiogram (ECG), electroencephalogram (EEG), and electromyogram (EMG) signal-processing techniques commonly used in the analysis and recognition of biomedical signals. In addition, it presents several medical data classifications, such as human activity recognition, microarray gene expression data classification for cancer detection, breast cancer detection, diabetes detection, and heart disease detection. Chapter 5 considers several applications, including intrusion detection, phishing website detection, spam e-mail detection, credit scoring, credit card fraud detection, handwritten digit recognition, image classification, and text classification. Chapter 6 provides regression examples, such as stock market analysis, economic variable forecasting, electrical load forecasting, wind speed forecasting, tourism demand forecasting, and house prices prediction. Chapter 7 includes several examples related to unsupervised learning (clustering).

    The main intent of this book is to help a wide range of readers to solve their own real-world problems, including IT professionals, analysts, developers, data scientists, and engineers. Furthermore, this book is intended to be a useful textbook for postgraduate and research students working in the areas of data science and machine learning. It also formulates a basis for researchers who are interested in applying machine learning methods to data analysis. In addition, this book will help a broad readership, including researchers, professionals, academics, and graduate students from a wide range of disciplines, who are beginning to look for applications in biomedical signal analysis, healthcare data analysis, financial and economic data forecasting, computer security, and more.

    Executing the code examples provided in this book requires Python 3.x or higher versions to be installed on macOS, Linux, or Microsoft Windows. The examples throughout the book frequently utilize the essential libraries of Python, such as SciPy, NumPy, Scikit- learn, matplotlib, pandas, OpenCV, Tensorflow, and Keras, for scientific computing.

    Acknowledgments

    First of all, I would like to thank my publisher Elsevier and its team of dedicated professionals who have made this book-writing journey very simple and effortless, as well as all those who have worked in the background to make this book a success.

    I would like to thank Sara Pianavilla and Rafael Trombaco for their great support. Also, I would like to thank Paul Prasad Chandramohan for being patient in getting everything necessary to complete this book.

    Abdulhamit Subasi

    Chapter 1

    Introduction

    Abstract

    Recently many achievements in artificial intelligence, machine learning, and data mining have been realized. Data collected from several sources contain valuable information that cannot be observed directly but is hidden in the data structure. This hidden information can be transformed into useful information by using various machine learning techniques. Moreover, data analysis using diverse machine learning algorithms can become a vital instrument in extracting significant information hidden in the data. Data analysis is used in many areas for decision support. Typically, collected data is evaluated by experts in the field, but this might result in unreliable or inefficient decision support. Consequently the aim of automated data analysis is to reduce the subjectivity of the expert assessment in business intelligence, or decision support. The artificial intelligence–based data analysis used for assessing different data characteristics helps to make objective decisions by improving accuracy.

    Keywords

    machine learning techniques

    artificial intelligence

    data mining

    business intelligence

    decision support

    1.1. What is machine learning?

    With improved computation power and storage of computers, our era became the age of information or the age of data. Additionally we must analyze big data and create intelligent systems by utilizing the concepts and techniques from artificial intelligence, data science, data mining, and machine learning. Of course, most of us have learned these terms and realize that data is the new oil. The most important task that organizations and businesses have employed in the last decade to utilize their data and understand and employ this information is for making better informed decisions. In fact, with big developments in technology, a successful environment has been created around fields such as machine learning, artificial intelligence, and deep learning. Researchers, engineers, and data scientists have created frameworks, tools, techniques, algorithms, and methodologies to achieve intelligent systems and models that can automate tasks, detect anomalies, perform complex analyses, and predict events (Sarkar, Bali, & Sharma, 2018).

    Machine learning is defined as computational techniques utilizing the experience to enhance performance or to achieve precise predictions. The experience denotes the previous information available to the learner that is naturally received from the electronic data recorded and made available for investigation. This data might be in the shape of digitized human-labeled training sets or other kinds of information collected by interacting with the ecosystem. In all situations, the data size and quality are critical for the accomplishment of the predictions made by the predictor. Machine learning is composed of creating competent and precise prediction algorithms. As in other fields of computer science, crucial measures of the quality of these methods are their space and time complexity. Nevertheless, in machine learning, the concept of sample complexity is needed to assess the sample size necessary for the algorithm to learn a group of notions. Usually, theoretical learning guarantees a method based on the complication of the model classes studied and the amount of training endured. As the performance of a learning technique is based on the data and features employed, machine learning is characteristically associated with statistics and data analysis. Typically, learning algorithms are data-driven techniques merging important concepts in computer science with concepts from probability, statistics, and optimization. Furthermore, these kinds of applications relate to broad categories of learning problems. The main types of learning problems are classification, regression, ranking, clustering, and dimension reduction (Mohri, Rostamizadeh, & Talwalkar, 2018).

    In classification, a category is assigned to every item. The number of categories can be small or large depending on the type of problem. In regression, a real value is predicted for every item. Stock value prediction or variations of economic variable prediction are regression problems. In regression problems, the penalty for an incorrect prediction is based on the value of the difference between the predicted and true values, whereas in the classification problem, there is characteristically no concept of closeness among different categories. In ranking, items are ordered according to certain measures. In clustering, items are partitioned into homogeneous regions. Clustering is generally employed to analyze big data sets. For instance, in case of social network analysis, clustering algorithms are used to identify communities inside large groups of people. Manifold learning, or dimensionality reduction, is used to transform an initial representation of items into a lower-dimensional representation while keeping some properties of the initial representation. The aims of machine learning are to achieve precise predictions for unseen data and design robust and effective algorithms to yield these predictions, even for big-scale problems (Mohri et al., 2018).

    Machine learning employs right features to create accurate models, which accomplish the right tasks. Actually, features define the relevant objects in our domain. A task is an abstract representation of a problem to be solved related to those domain objects. The general form of this is classifying them into two or more classes. Most of these tasks can be characterized as mapping from data points to outputs. This mapping or model is itself formed as the output of a machine learning method by utilizing training data (Flach, 2012). We will discuss tasks and problems, which can be solved by utilizing machine learning. No matter what type of machine learning models encountered, they are designed to solve only a small number of tasks and utilize only a few numbers of features.

    Most of the time, the knowledge or insight we are trying to extract from raw data will not be understandable by looking at the data. Machine learning converts data into information. Machine learning sits at the intersection of statistics, engineering, and computer science and is frequently seen in other fields. It can be used in a variety of fields, such as finance, economy, politics, geosciences, and medicine. It is a tool to solve different problems. Any field that requires understanding and working with data can benefit from machine learning methods. There are many problems in which the solution is not deterministic. Hence, we need statistics for these problems (Harrington, 2012).

    This book presents an example-based approach to cover different practices, concepts, and problems related to Machine Learning. The main idea is to give readers enough knowledge on how we can solve the Machine Learning problems, and how we can use the main building blocks of Machine Learning in data analysis. This will enable the reader to learn about how Machine Learning can be utilized to analyze data.

    1.1.1. Why is machine learning needed?

    Human beings are the most intelligent creatures in this world. They can define, create, asses, and solve complex problems. The human brain is still not explored completely, and therefore artificial intelligence has still not beaten human intelligence in various ways. In view of what you have studied so far, although the conventional programming model is rather good and domain expertise and human intelligence are absolutely vital components in making data-driven decisions, machine learning is needed to produce precise and quicker decisions. The machine learning technique considers data and anticipated outputs or results, if any, and utilizes the computer to create the program, which can be identified as a model. This model can then be employed in the future to make required decisions and produce anticipated outputs from new data. The machine attempts to employ input data and anticipated outputs to learn characteristic patterns from the data, which can eventually help create a model similar to a computer program that may help in making data-driven decisions in the future (classify or predict) for novel input data points by utilizing the learned information from past experience. This will be clear when we consider a real-world problem, such as handling infrastructure for a decision support company. In order to solve a problem with machine learning, we should implement the following steps.

    • Utilize device data and logs to obtain sufficient historical data in a certain data warehouse.

    • Determine key data attributes, which might be beneficial for creating a model.

    • Monitor and record device attributes and their behavior for long time intervals, which contain normal device behavior and anomalous device behavior or outliers.

    • Use these input and output pairs with any particular machine learning method to create a model that learns characteristic design patterns and detects consistent output.

    • Rearrange this model by utilizing unseen values of device attributes to predict if a specific device is working normally or of it may produce a prospective output. Hence when a machine learning model is developed, it can be organized easily to create an intelligent framework around it such that devices cannot only be monitored reactively but potential problems can be proactively detected and even fixed before any issue appears.

    In fact, the workflow debated earlier with the series of stages required for creating a machine learning model is considerably more complicated than how it has been depicted. However, this is just to highlight and help you think more theoretically rather than technically in the case of machine learning processes and also show that you need to change your thinking from the conventional ways toward a more data-driven manner. The magnificence of machine learning is that it is never domain constrained and you can employ algorithms to eliminate obstacles covering several areas, industries, and businesses. Similarly, it is not necessary to employ output data points to construct a model; occasionally input data is adequate for unsupervised learning (Sarkar et al., 2018).

    1.1.2. Making data-driven decisions

    Extracting crucial insights or information from the data is the main goal of companies and business organizations investing deeply in a good workforce like artificial intelligence and machine learning. The concept of data-driven decisions is not novel and has been used for decades in the field of statistics, management information systems, and operations research to improve effectiveness of decisions. Obviously, it is easier to talk than to implement since we can clearly utilize data to make any perceptive decisions. Additional imperative characteristics of this problem is that generally we utilize the power of intuition or reasoning to make decisions based on what we have experienced in the past. Our brain is a powerful element that helps us recognize people in images, understand what our colleagues or friends are saying, decide whether to accept or refuse a business transaction, and so on. Our brain does most of the thinking for us. This is precisely why it is hard for machines to learn and solve problems such as computing tax rebates or loan interests. Remedies to these problems are to utilize different approaches such as data-driven machine learning techniques to improve the decisions. Although data-driven decision making is of vital meaning, it also needs to be implemented at scale and with efficiency. The main idea of utilizing artificial intelligence or machine learning techniques is to automate tasks or procedures by learning specific patterns from the data (Sarkar et al., 2018).

    Nowadays, the majority of the workforce in developed countries is moving from manual labor to knowledge work. Events are much more uncertain at the moment; minimize risk job assignments, such as maximize profits and find the best marketing strategy, are all too common. The knowledge accessible from the World Wide Web creates the work of knowledgeable employees even tougher. Producing wisdom from all the data with our job in mind turns out to be a more crucial talent. With so many economic activities reliant on information, we cannot afford to be lost in the data. Machine learning helps to analyze all the data and extract valuable information (Harrington, 2012).

    1.1.3. Definitions and key terminology

    It is common practice to calculate something and sort out the significant portions later. The items that should be assessed are called features or attributes and form an instance (Harrington, 2012). One of the crucial steps in machine learning is the feature extraction. Accordingly, the data to be processed, composed of several points, and characteristic and informative features can be extracted by employing different feature extraction techniques. These informative and characteristic parameters describe the behavior of the data, which may specify a precise achievement. Highlighting informative and characteristic features can describe data in better ways. These features can be extracted employing diverse feature extraction algorithms, which are another step in data analysis to make simpler the succeeding stage for classification (Graimann, Allison, & Pfurtscheller, 2009). It is crucial to deal with a smaller number of samples that express suitable features of the data to accomplish better performance. Features are generally collected into a feature vector by transforming data into a related feature vector known as feature extraction. Characteristic features of data are examined by a data classification structure, and based on those distinctive features, the class of the data is decided (Subasi, 2019a).

    The extracted features of the problem are not enough to completely explain the nature of data for many cases. Particularly, while describing the problem, employing a suboptimal or redundant feature set creates this kind of problem. Instead of seeking better features, it is better to assume that there is a nonlinear relation between input and output of the given system. For instance, an automatic diagnostic system for disease detection that uses biosignal wave forms employs the processed data as input. The aim is to study the relationship between the information that is given to the system and associated with disease. After training, when we give new data to the system, it will identify the correct disease. These kinds of tasks can be accomplished by machine learning techniques easily (Begg, Lai, & Palaniswami, 2008; Subasi, 2019b).

    There are several computational intelligence techniques, such as supervised learning, unsupervised learning, reinforcement learning, and deep learning. Among these learning paradigms, the most studied one is the supervised learning technique that is based on function estimation. A set of examples is given to the supervised learning formulas by an external supervisor with the class label. The system identifies the hidden relationship between the sample set and desired output. After this training phase, it is easy to predict the output for unknown examples. Reinforcement learning, stochastic learning, and risk minimization are some paradigms in supervised learning (Begg et al., 2008; Subasi, 2019b).

    Classification is one of the duties in machine learning. For instance, we want to differentiate epileptic EEG signals from a normal EEG signal. We must use the EEG equipment and then hire a neurologist (EEG expert) to analyze the EEG signal taken from a subject. This might be expensive and cumbersome, and the expert neurologist can only be in one place at a time. We can automate this process by attaching the EEG equipment to a computer to identify the epileptic patient. How do we then decide if a subject has epilepsy or not? This task is termed as classification, and there are numerous machine learning techniques that are good at classification. The class in this example is the epileptic or normal. If we decided on a machine learning technique to utilize for classification, the next step is to train the algorithm or allow it to learn. In order to train the algorithm, it must be fed quality data known as a training set. A training set is the set of training examples that is used to train the machine learning algorithms. Each training instance has numerous features and one target variable (class). The target variable is utilized to predict with the machine learning technique. In classification the target variable takes on a nominal value, and in the task of regression its value can be continuous. The target variable (class) is known in the training set. The machine learns by discovering some relationship between the features and the target variable. The target variable is the types or classes, so it can be reduced to take nominal values. In the classification problem the target variables (classes) are assumed to be a

    Enjoying the preview?
    Page 1 of 1