Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Fun Q: A Functional Introduction to Machine Learning in Q
Fun Q: A Functional Introduction to Machine Learning in Q
Fun Q: A Functional Introduction to Machine Learning in Q
Ebook828 pages14 hours

Fun Q: A Functional Introduction to Machine Learning in Q

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Bring the power of machine learning to the fastest time-series database. Fun Q uses the powerful q programming language to implement many of the most famous machine-learning algorithms. Using a meticulously factored machine-learning library, each algorithm is broken into its basic building blocks and then rebuilt from scratch. &nb

LanguageEnglish
PublisherVector Sigma
Release dateDec 21, 2020
ISBN9781734467512
Fun Q: A Functional Introduction to Machine Learning in Q

Related to Fun Q

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Fun Q

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Fun Q - Nick Psaris

    Fun Q

    Fun Q

    A Functional Introduction to Machine Learning in Q

    Nick Psaris

    Copyright © 2020 Nick Psaris

    All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher.

    Kx and kdb+ are registered trademarks of Kx Systems, Inc.

    Books may be purchased in quantity and/or special sales by contacting the publisher, Vector Sigma at sales@vector-sigma.com.


    Preface1. Introduction1.1. Who is This Book For?1.2. How to Read This Book1.3. Getting Started1.4. What is in This Book?1.5. Naming Conventions2. K-Nearest Neighbors2.1. K-Nearest Neighbors Example2.2. Iris Data Set2.3. Downloading Data Sets2.4. Loading Data Sets2.5. Partitioning Data2.6. Distance Metrics2.7. Norm Metrics2.8. Minimize Flipping2.9. Finding K Neighbors2.10. Weighted Average or Mode2.11. Weighting Schemes2.12. Euclidean Distance Squared2.13. Pairwise Euclidean Distance Squared2.14. K-Fold Cross Validation3. K-Means3.1. K-Means Example3.2. Centroid Initialization3.3. K-Means++ Initialization3.4. Lloyd’s Algorithm3.5. Ungrouping Clusters3.6. The Elbow Method3.7. Partitional Clustering4. Hierarchical Agglomerative Clustering4.1. Hierarchical Agglomerative Clustering Example4.2. Cluster Benchmark Data Sets4.3. Dissimilarity Matrix4.4. Lance-Williams Algorithm4.5. Linkage Functions4.6. Building Clusters4.7. The Silhouette Method5. Expectation-Maximization5.1. Bernoulli Distribution5.2. Expectation-Maximization Algorithm5.3. Binomial Distribution5.4. Multinomial Distribution5.5. Bernoulli Mixture Models5.6. Gaussian Mixture Model6. Naive Bayes6.1. Fit Naive Bayes6.2. Gaussian Naive Bayes6.3. Multinomial Naive Bayes6.4. Binary Classification6.5. Text Classification7. Decision Trees7.1. Impurity Functions7.2. Iterative Dichotomizer 37.3. Visualizing Decision Trees7.4. Decision Tree Classification7.5. Building the Decision Tree7.6. Gain Functions7.7. Cost Complexity Pruning7.8. K-Fold Cross Validation7.9. Regression Trees7.10. One Hot8. Decision Tree Ensembles8.1. Ensemble Example8.2. Wisconsin Diagnosis Breast Cancer Data Set8.3. Bootstrap Aggregation8.4. Random Forest8.5. Adaptive Boosting8.6. Discrete Adaboost8.7. AdaBoost Cross Validation9. Linear Regression9.1. Matrices9.2. Matrix Operators9.3. Least Squares Regression9.4. Multi-Dimensional Least Squares9.5. Custom Matrix Operators9.6. Normal Equations9.7. Gradient Descent9.8. Linear Regression Cost9.9. Linear Regression Gradient9.10. Feature Scaling9.11. FMINCG9.12. Stopping Conditions9.13. Stochastic Gradient Descent9.14. Regularization9.15. Optimal Regularization10. Logistic Regression10.1. Logistic Regression Example10.2. Wisconsin Diagnosis Breast Cancer Data Set10.3. Sigmoid10.4. Log-Loss Cost10.5. Logistic Regression Cost10.6. Logistic Regression Gradient10.7. Optimal Regularization10.8. Receiver Operating Characteristic10.9. Area Under the Curve10.10. Multi-Class Classification10.11. One-vs.-All Classifier10.12. Fitting One vs. All10.13. Predicting One vs. All11. Neural Networks11.1. Perceptron11.2. Neural Network11.3. Neural Network Example11.4. Neural Network Matrices Initialization11.5. Neural Network Topology11.6. Neural Network Weight Initialization11.7. Neural Network Cost11.8. Neural Network Gradient11.9. Neural Network Activation Functions11.10. Neural Network Cost and Gradient11.11. Neural Network Multinomial Classification11.12. Neural Network Regression12. Recommender Systems12.1. MovieLens Data Set12.2. Content-Based Filtering12.3. Rating Data Summary12.4. Null-Aware Aggregations12.5. User-User Collaborative Filtering12.6. Item-Item Collaborative Filtering12.7. Collaborative Filtering12.8. Collaborative Filtering Cost12.9. Collaborative Filtering Gradient12.10. Stochastic Gradient Descent12.11. Alternating Least Squares13. PageRank13.1. Adjacency Matrix13.2. Markov Matrix13.3. Google Matrix13.4. Power Method13.5. Algebraic PageRank13.6. Iterative PageRank13.7. Sparse Matrices13.8. Sparse Matrix Iterative PageRank14. Graphics14.1. ASCII plots14.2. Mandelbrot Set14.3. Portable Image Formats14.4. Sparklines14.5. Charting Stock PricesA. Source CodeA.1. UtilitiesA.2. Machine-Learning AlgorithmsA.3. External Libraries Ported to QA.4. Data LibrariesIndex

    List of Figures

    7.1. Weather Decision Tree11.1. Perceptron with Heaviside Activation11.2. Neural Network with Two Hidden Layers13.1. Example Web Page Links14.1. Black and White Mandelbrot Set14.2. Grayscale Mandelbrot Set

    List of Examples

    2.1. Distance Metrics3.1. K-Means3.2. K-Means Initial Clusters3.3. K-Means Centroids After One Round3.4. K-Means Clusters After One Round3.5. K-Means Centroids After Two Rounds3.6. K-Means Centroids After Three Rounds3.7. K-Means Clusters After Convergence3.8. Contrived K-Means3.9. Sum of Squared Errors Within3.10. Percentage of Total Variance4.1. Large HAC Clusters4.2. Silhouette I4.3. Small HAS Clusters4.4. Silhouette II5.1. Binomial EM Confusion Matrix5.2. Multinomial EM Confusion Matrix5.3. MNIST Digits5.4. Prototypical Digits with Random Noise5.5. Prototypical Digits5.6. Bernoulli Mixture Model EM Confusion Matrix5.7. Gaussian Mixture Model EM Confusion Matrix6.1. Naive Bayes Mistakes6.2. Binary Classification Confusion Matrix7.1. Mean Squared Errors7.2. Misclassification7.3. Entropy7.4. Gini8.1. Bootstrap Aggregation Accuracy8.2. Random Forest Regression Training Error8.3. Random Forest Regression Test Error8.4. Random Forest Classification Training Error8.5. Random Forest Classification Test Error8.6. AdaBoost Training Accuracy8.7. AdaBoost Test Accuracy9.1. Sine9.2. Linear Regression Cost Convergence9.3. Optimal Linear Regression Regularization10.1. Sigmoid10.2. Log-Loss10.3. Optimal Logistic Regression Regularization10.4. Logistic Regression Binary Classification Confusion Matrix10.5. Low Threshold Binary Classification Confusion Matrix10.6. High Threshold Binary Classification Confusion Matrix10.7. Discriminating Receiver Operating Characteristic10.8. Non-discriminating Receiver Operating Characteristic10.9. Negative Discriminating Receiver Operating Characteristic10.10. MNIST Digits10.11. Logistic Regression One-vs.-All Confusion Matrix10.12. Incorrectly Classified MNIST Digits11.1. Neural Network Multinomial Classification Confusion Matrix11.2. Row-Normalized Confusion Matrix11.3. Column-Normalized Confusion Matrix14.1. Heat Map14.2. Axis Labels14.3. Black and White Mandelbrot Set14.4. Gray Scale Mandelbrot Set

    List of Equations

    2.1. Manhattan Distance2.2. Euclidean Distance2.3. Minkowski Distance2.4. Manhattan Norm2.5. Euclidean Norm2.6. Parameterized Norm2.7. Cosine Similarity2.8. Pairwise Euclidean Distance Squared4.1. Lance-Williams Recursive Update5.1. Binomial Likelihood5.2. Binomial Maximum Likelihood Estimator5.3. Multinomial Likelihood5.4. Multinomial Mixture Model Maximum Likelihood Estimator5.5. Bernoulli Mixture Model Likelihood5.6. Binomial Mixture Model Maximum Likelihood Estimator5.7. Multivariate Gaussian Likelihood5.8. Multivariate Gaussian Maximum Likelihood Estimator7.1. Decision Tree Cost Complexity8.1. AdaBoost Strong Classifier8.2. Exponential Loss9.1. Normal Equations9.2. Regularized Linear Regression Cost9.3. Regularized Linear Regression Gradient9.4. Ridge Regression10.1. Sigmoid10.2. Log-Loss10.3. Regularized Logistic Regression Cost10.4. Regularized Logistic Regression Gradient11.1. Heaviside Step11.2. Neural Network Cost11.3. Sigmoid11.4. Sigmoid Gradient11.5. Hyperbolic Tangent11.6. Hyperbolic Tangent Gradient11.7. ReLU11.8. ReLU Gradient11.9. LReLU11.10. LReLU Gradient11.11. Softmax11.12. Cross Entropy12.1. Collaborative Filtering Cost12.2. Collaborative Filtering Gradient12.3. Matrix-Factorization Optimization12.4. Matrix-Factorization Updates14.1. Complex Multiplication

    List of Figures

    7.1. Weather Decision Tree11.1. Perceptron with Heaviside Activation11.2. Neural Network with Two Hidden Layers13.1. Example Web Page Links14.1. Black and White Mandelbrot Set14.2. Grayscale Mandelbrot Set

    List of Examples

    2.1. Distance Metrics3.1. K-Means3.2. K-Means Initial Clusters3.3. K-Means Centroids After One Round3.4. K-Means Clusters After One Round3.5. K-Means Centroids After Two Rounds3.6. K-Means Centroids After Three Rounds3.7. K-Means Clusters After Convergence3.8. Contrived K-Means3.9. Sum of Squared Errors Within3.10. Percentage of Total Variance4.1. Large HAC Clusters4.2. Silhouette I4.3. Small HAS Clusters4.4. Silhouette II5.1. Binomial EM Confusion Matrix5.2. Multinomial EM Confusion Matrix5.3. MNIST Digits5.4. Prototypical Digits with Random Noise5.5. Prototypical Digits5.6. Bernoulli Mixture Model EM Confusion Matrix5.7. Gaussian Mixture Model EM Confusion Matrix6.1. Naive Bayes Mistakes6.2. Binary Classification Confusion Matrix7.1. Mean Squared Errors7.2. Misclassification7.3. Entropy7.4. Gini8.1. Bootstrap Aggregation Accuracy8.2. Random Forest Regression Training Error8.3. Random Forest Regression Test Error8.4. Random Forest Classification Training Error8.5. Random Forest Classification Test Error8.6. AdaBoost Training Accuracy8.7. AdaBoost Test Accuracy9.1. Sine9.2. Linear Regression Cost Convergence9.3. Optimal Linear Regression Regularization10.1. Sigmoid10.2. Log-Loss10.3. Optimal Logistic Regression Regularization10.4. Logistic Regression Binary Classification Confusion Matrix10.5. Low Threshold Binary Classification Confusion Matrix10.6. High Threshold Binary Classification Confusion Matrix10.7. Discriminating Receiver Operating Characteristic10.8. Non-discriminating Receiver Operating Characteristic10.9. Negative Discriminating Receiver Operating Characteristic10.10. MNIST Digits10.11. Logistic Regression One-vs.-All Confusion Matrix10.12. Incorrectly Classified MNIST Digits11.1. Neural Network Multinomial Classification Confusion Matrix11.2. Row-Normalized Confusion Matrix11.3. Column-Normalized Confusion Matrix14.1. Heat Map14.2. Axis Labels14.3. Black and White Mandelbrot Set14.4. Gray Scale Mandelbrot Set

    List of Equations

    2.1. Manhattan Distance2.2. Euclidean Distance2.3. Minkowski Distance2.4. Manhattan Norm2.5. Euclidean Norm2.6. Parameterized Norm2.7. Cosine Similarity2.8. Pairwise Euclidean Distance Squared4.1. Lance-Williams Recursive Update5.1. Binomial Likelihood5.2. Binomial Maximum Likelihood Estimator5.3. Multinomial Likelihood5.4. Multinomial Mixture Model Maximum Likelihood Estimator5.5. Bernoulli Mixture Model Likelihood5.6. Binomial Mixture Model Maximum Likelihood Estimator5.7. Multivariate Gaussian Likelihood5.8. Multivariate Gaussian Maximum Likelihood Estimator7.1. Decision Tree Cost Complexity8.1. AdaBoost Strong Classifier8.2. Exponential Loss9.1. Normal Equations9.2. Regularized Linear Regression Cost9.3. Regularized Linear Regression Gradient9.4. Ridge Regression10.1. Sigmoid10.2. Log-Loss10.3. Regularized Logistic Regression Cost10.4. Regularized Logistic Regression Gradient11.1. Heaviside Step11.2. Neural Network Cost11.3. Sigmoid11.4. Sigmoid Gradient11.5. Hyperbolic Tangent11.6. Hyperbolic Tangent Gradient11.7. ReLU11.8. ReLU Gradient11.9. LReLU11.10. LReLU Gradient11.11. Softmax11.12. Cross Entropy12.1. Collaborative Filtering Cost12.2. Collaborative Filtering Gradient12.3. Matrix-Factorization Optimization12.4. Matrix-Factorization Updates14.1. Complex Multiplication

    Preface

    I love data. Let me clarify that. I love structured data. My first job was to administer a Sybase database and the Perl scripts that loaded client positions. I then transitioned to a trading desk where we used Visual Basic for Applications (VBA) to enter trades into a Microsoft Access database. From there, I moved to an automated trading team that stored all data in proprietary self-describing text messages.

    Lucky for me, I switched to a team that used kdb+ as the data warehouse and therefore q as the programming language. Over the ensuing years, I continued to use q as my development language of choice, and spent many days—if not weeks—writing functions and C-bindings to import data from (and export data to) different formats including fixed-width binary files and proprietary middleware APIs. The goal in each case was to bring more data to the q ecosystem.

    Convinced that q was the language of choice for big data analysis, I turned my attention to bringing analytics to the language as well. Data scientists often treat kdb+ as nothing more than a source of data. To accommodate the limited capacity for large data sets in languages such as R and Python, they are required to aggregate the data and, in the process, lose information. Instead of bringing summary statistics into languages that were not designed for scale, wouldn’t it be better to perform the analysis directly on the data stored in kdb+?

    I was first introduced to machine-learning algorithms in 2015 when a friend suggested I take the Coursera Machine Learning course[1] taught by Coursera’s co-founder, Andrew Ng. After completing all the homework assignments in Octave (an open source alternative to Matlab), I decided to re-implement the whole course in q.

    My goal was to push q to its limits, find its Achilles heel and keep pushing until it begged for mercy. In the process, I found the functional vector implementation of many algorithms quite elegant. Others, however, were down-right embarrassing. I found that q’s internal data structures were not designed for fast matrix multiplication. With carefully factored code, however, and the use of an open-source math library, the performance could be improved dramatically. I found the lack of complex data types frustrating and wished q had native sparse matrix operators. I found that transposing large matrices severely impacts performance and algorithms must be designed to prevent this. And finally, I found that q’s limits on function, argument and branch length make translating code from other languages very difficult without refactoring.[2]

    I became addicted to implementing machine-learning algorithms and continued to read the original papers for algorithms such as decision trees, adaptive boosting and recommender systems. As I added each new algorithm, I refactored the library to maximize code performance and reuse. I wrote each algorithm on the free 32-bit version of kdb+ and limited all interaction (including visualization) to the q terminal. To make the code suitable for inclusion in this book, all functions and comments were required to be less than 77 characters wide. I believe this constraint improved the code clarity by forcing algorithms to be refactored into functions with descriptive names.

    I learned that there exists a problem domain where q implementations of machine-learning algorithms, such as k-means, perform well. Others, such as support vector machines, are best delegated to the canonical implementation written in C/C++. This book is dedicated to machine-learning algorithms implemented in pure q.

    I enjoy coding in q. In fact, I truly believe coding in q is fun. My goal is to share this passion with you while simultaneously teaching you machine-learning algorithms and concepts as well as many q coding paradigms. Along the way, I hope you have fun too.


    [1] Machine Learning, Coursera. [Online]. Available: https://www.coursera.org/learn/machine-learning. [Accessed: 12-Mar-2020].

    [2] Some of these limits have been relaxed with the release of kdb+ 3.6. Instructions embedded within if, do, and while branching control structures were previously limited to 255 bytecodes. This has now been extended to 65,025 bytecodes. The number of permitted constants, as well as the number local and global variables, has also been increased. The limit of 8 function arguments, however, still remains.

    Chapter 1. Introduction

    Data comes in many shapes: bytes, text, integral and floating. Data comes in many sizes: boolean flags, text config files, binary images and partitioned databases. Stored and ignored, data is expensive to maintain and backup. The quality of the data is only revealed once it is analyzed. And the value of the data is only determined once it is used to make and test predictions.

    Analysis can take the form of visual inspection or statistical summary. Predictions are routinely performed and tested in academia and business. The goal of accurate predicting varies slightly between these two fields, however. While academia strives for truth, business searches for profit. And where there is incentive in academia to share data and analysis, business has historically held it captive.

    Recent advances in computational analysis and computing power, and the reduction in storage costs, have ignited a data revolution. Whether we call it data science, machine learning, deep learning or artificial intelligence, the combined power of data and computational analysis reconfirms that we live in the information age.

    Data is worthless if you can’t extract information from it. As Clifford Stoll and Gary Schubert are quoted with saying:

    Each step along the path from data to wisdom requires the expenditure of effort and results in an increase in organization and structure. Recent developments in machine-learning techniques have enabled computers to automate the first step from data to information. The success of these algorithms can be attributed to 10 to 100-fold improvement in algorithm design, an increase in computer speeds by at least 100-fold and an explosion in the amount of data (up to 1,000-fold).[3]

    Given the universal availability of algorithms and hardware, a remaining problem is accessing the volumes of data. Kdb+ is known not only for its database performance, but also for its flexible programming language q. Would it not make sense, therefore, to bring the machine-learning algorithms to the data stored in a kdb+ database?

    1.1. Who is This Book For?

    This book assumes an introductory knowledge of kdb+. It also assumes you have q installed (or know how to install it). Most q developers use the language to build tickerplants, databases or CEP engine. Few use q as an analytics platform, however. This is beginning to change. This book is for the community of developers who, like me, have fallen in love with programming in q and want to extend their computing knowledge into the world of big data and machine-learning algorithms while simultaneously learning how to write better q. It is also for the data scientists who find that the data they need—or want—to analyze is increasingly being stored in a kdb+ database. Using their existing machine-learning knowledge, they will be able to learn q programming idioms that will improve their ability to write efficient code.

    1.2. How to Read This Book

    This book was written to be read linearly from beginning to end. All code required to run the examples in these chapters is explained when first introduced. You may, of course, jump to future chapters if you are comfortable using functions without understanding their implementation. The last chapter, Chapter 14, is an exception and can be read at any time you are looking for some fun.

    I am excited that you purchased this book and are interested in both machine learning and q. Having the book on your desk is a great first step, but the knowledge embedded within won’t magically jump to your brain. I suggest that you run each of the examples yourself. Play with the inputs, examine the outputs and perhaps step through the algorithms by declaring variables with the same name as the function parameters. This is made easier by the consistent naming conventions used in the funq project and listed in Section 1.5. I wrote this book for you and hope you make it through each chapter.

    1.3. Getting Started

    The source code for this book is included in Appendix A for reference, but can—and should—be downloaded from https://www.fun-q.net. To ensure compatibility with the examples in this book, download release v1.0.0. Once downloaded, you should start your q process from the funq project folder. In this example, I’ve downloaded the 32-bit version of kdb+ 4.0. I always use the -s command line option to start q with multiple secondary threads to enable parallel code execution.

    bash-3.2$ cd funq

    bash-3.2$ q -s 4

    KDB+ 4.0 2020.06.18 Copyright (C) 1993-2020 Kx Systems

    m32/ 4(4)core 8192MB nick nicks-macbookpro.local 192.168.1.3 NONEXPIRE

     

    q)

    There are four main libraries that include the utilities and machine-learning algorithms: ut.q, ml.q, fmincg.q and porter.q. The funq.q library loads each of these files.[4]

    q)\l funq.q

    q)

    In addition to the directories distributed with q, the ut, ml, fmincg and porter directories should now be visible.

    q)key `

    `q`Q`h`j`o`ut`ml`fmincg`porter

    The funq project also includes many other libraries dedicated to downloading data from the internet and loading them into memory. We can, for example, load the full text of Pride and Prejudice by Jane Austen.

    q)\l pandp.q

    [down]loading pride and prejudice text

    http://www.gutenberg.org/files/1342/1342_0.txt

    The text, with each chapter represented by a vector of characters, can then be found in pandp.s.

    q)pandp.s

    "It is a truth universally acknowledged, that a single man in\npossession..

    "Mr. Bennet was among the earliest of those who waited on Mr.\nBingley. H..

    "Not all that Mrs. Bennet, however, with the assistance of her\nfive daug..

    "When Jane and Elizabeth were alone, the former, who had been\ncautious i..

    "Within a short walk of Longbourn lived a family with whom the\nBennets w..

    "The ladies of Longbourn soon waited on those of Netherfield. The\nvisit ..

    "Mr. Bennet's property consisted almost entirely in an estate of\ntwo tho..

    ..

    Other data sets automatically load their contents as tables, matrices and vectors—depending on the appropriate abstraction.

    Writing a book that requires data is a brittle process. The examples will break if the data sets move location or change format. All data sets are, therefore, loaded from a dedicated q source file. If future modifications are required, the funq project will be updated to ensure the examples remain valid. Also, if you find that a data set required in the examples below has changed location, please raise an issue in the funq repository.

    Installing 7-Zip

    The funq project was written with the expectation that it can be run on any platform that kdb+ is supported: Windows, Linux, Solaris and Mac operating systems. Although Linux, Solaris, and Mac come with utilities to uncompress data files in various formats such as .zip, .gz and .tar.gz, Windows does not. To uncompress files on a machine running the Windows operating system, the ut.q library uses the 7-Zip utility (7z.exe). This utility can be downloaded from https://www.7-zip.org.

    1.4. What is in This Book?

    We begin our journey through machine-learning algorithms in Chapter 2 with the most intuitive supervised learning algorithm: k-nearest neighbors (k-NN). The chapter starts by introducing the funq project, which includes all the code required to run the examples in this book. It then demonstrates the project’s ability to download external data sets by loading, and performing k-NN analysis on, the ever-popular Iris data set. It then takes a step back to explain different distance metrics that are used in machine learning and their relationship with norm metrics. The k-NN implementation is then discussed and examples of using it with different distance metrics are presented. A few common optimizations are then reviewed and the chapter concludes with a presentation of the confusion matrix and a discussion on using cross validation to pick the best value for k (the number of neighbors).

    We next turn to a simple, yet elegant, unsupervised clustering algorithm in Chapter 3 called k-means. To explain the k-means algorithm, we discuss the implementation of the Lloyd method of iterative assignment and update steps, which results in the convergence of k distinct clusters. Three different initialization techniques are covered including random cluster selection, random cluster assignment and an algorithm specifically designed to produce distant clusters: k-means++. The elbow method is then used to pick the optimal number of neighbors. To demonstrate the flexibility of the algorithm, different distance metrics are used to implement k-medians and k-harmonic means. The chapter ends with a discussion of k-medoids, which requires the centroids—the center of each cluster—be actual points within the data set.

    We continue our discussion on clustering techniques in Chapter 4 by presenting hierarchical agglomerative clustering. The chapter begins by downloading and describing the University of Eastern Finland’s clustering benchmark data set. After first building a dissimilarity matrix, the iterative Lance-Williams algorithm is dissected and its efficiency and flexibility are demonstrated. Different linkage functions are defined and Ward’s method is explained. Methods for generating and ungrouping clusters are reviewed before the chapter concludes by demonstrating how the silhouette cluster quality method can be used to determine the optimal number of clusters.

    The final clustering algorithm, expectation-maximization (EM), is described in Chapter 5. The expectation-maximization algorithm, which is used to estimate distribution parameters, can be used as a generalization of the k-means clustering algorithm. Where the k-means algorithm assigns each observation to a single cluster, the expectation-maximization algorithm assigns each observation a probability of belonging to each of the specified clusters. Probabilities require distributions and the chapter introduces the binomial, multinomial, Bernoulli mixture model and Gaussian mixture model distributions using coin flips, dice rolls, the MNIST handwritten-digit data set and the Iris data set respectively. The concepts of likelihood and log likelihood estimators are also introduced.

    Chapter 6 returns to supervised learning algorithms and the use of probability distributions with the naive Bayes algorithm. The Iris data set is used again to demonstrate the implementation of Gaussian naive Bayes while the SMS spam data set is analyzed with multinomial naive Bayes. The SMS spam data set is our first introduction to text processing—also referred to as natural language processing (NLP). We then introduce a q implementation of the Porter stemmer--which allows us to combine English words with the same meaning: transformer and transformers for example. We then review binary classification statistics that allow us to quantify how good our binary predictions are. The chapter concludes by introducing the term-frequency inverse document frequency (TF-IDF) statistic and uses it to predict which Jane Austen book a randomly selected chapter comes from.

    Chapter 7 describes a powerful (and perhaps my favorite) machine-learning algorithm—decision trees. By continuously splitting a data set based on the values of each feature, we build a tree that, most importantly, creates an intuitive model of the underlying data. Decision trees have a long history and many competing designs. The funq project provides a general-purpose tree-building algorithm that is parameterized to allow the generation of multiple implementations. After the chapter introduces the impurity functions, which are used to determine how much information is in each of the data set features, the ID3 classification-tree algorithm is examined and applied to the famous weather data set. Before proceeding to review more complex classification and regression trees (CART), the chapter demonstrates two visualization tools: the first produces a terminal-appropriate text tree while the second generates output that can be used to produce an image with the Graphviz tool. Digging into the algorithm, the chapter continues to discuss different information-gain calculations, which are used to identify which feature should be split at each branch in the tree. The Iris data set is once again used to demonstrate the power of decision trees. To prevent overly bushy trees and their tendency to over fit, a pruning technique and cross validation are used to pick the appropriate tree complexity. Regression trees that can be used to predict numeric values are then introduced and the chapter concludes by explaining how one-hot encoding can be used to transform categorical features into multiple features with binary values.

    If one decision tree is good, multiple decision trees must be great. Chapter 8 introduces the concept of ensemble learning to distinguish between benign and malignant tumors in the Wisconsin Diagnosis Breast Cancer data set. By combining multiple fully grown trees, we can reduce the tendency of single trees to overfit. This is explained in the section about random forests. The second ensemble method combines multiple decision stumps—decision trees with a single branch—in an algorithm called AdaBoost. By expertly combining multiple weak classifiers, we build a powerful strong classifier. The chapter concludes by using cross validation to determine the optimal number of decision stumps to combine.

    We next take our first step towards the magic of deep learning. The journey starts in Chapter 9 with linear regression. Matrices and matrix operators are reviewed and used with the wine-quality data set to predict the quality of red wine. We then review abstractions built into the funq project that enable faster matrix operations. After presenting the normal equations, gradient descent is introduced so optimization problems without closed-form solutions can be solved. After describing the concept of feature scaling, we present functions to perform conjugate gradient and stochastic gradient descent. L1 and L2 regularization functions are reviewed before their relationship with LASSO, ridge regression and elastic net is explained. The chapter ends with a review of the cross-validation technique used to determine the best regularization parameters.

    A popular binary classification algorithm called logistic regression is used with the Wisconsin Diagnosis Breast Cancer data set in Chapter 10 to model tumor malignancy. The chapter begins by introducing the sigmoid function and the log-loss cost function. In order to find optimal model parameters with gradient descent, the logistic regression cost and gradient functions are defined. Multi-class classification then explained and a one-vs.-all classifier is implemented. The chapter concludes by testing the accuracy of the algorithm on the MNIST handwritten-digit data set.

    Having warmed up with a lap through logistic regression, Chapter 11 sprints through the neural network finish line. The chapter begins with a brief history of the perceptron before describing a fully-connected neural network with a single hidden layer. To prevent the common vanishing-gradient problem, two different parameter-initialization techniques are reviewed. The cost and gradient functions for a neural network with a sigmoid activation function and log-loss cost function are introduced along with the concepts of feed-forward loss calculation and backward propagation. To demonstrate the flexibility of the functional neural network implementation, different activation functions—such as the hyperbolic tangent and rectified linear unit relu—and cost functions—such as the cross entropy and softmax functions—are used to analyze the MNIST handwritten-digit data set. The chapter concludes by demonstrating how, in addition to classification problems, neural networks can be used to model regression problems as well.

    With the concepts of linear regression, k-nearest neighbors, gradient descent and regularization firmly in our tool belt, Chapter 12 leverages these techniques to implement movie recommender systems. The chapter begins by reviewing the MovieLens data set and uses linear regularization to create a content-based recommender system. By using genre labels such as Romance and Comedy, the model is able to recommend movies that match the genres that we have ranked the highest. We then implement a user-user collaborative filtering recommender system that uses k-nearest neighbors to recommend movies other people with similar tastes also liked. We then introduce two matrix-factorization techniques that decompose the rating matrix into user and item matrices that reflect the user’s interest in, and the movie’s exposure to, hidden features. The first technique uses gradient descent and the second uses a method called alternating least squares.

    We finish our discussion of machine learning with an algorithm that made it to the top 10 machine-learning algorithms in 2006. It has since dropped off the list but remains a powerful algorithm that affects our everyday lives. Chapter 13 explains the PageRank algorithm, which was developed by Larry Page and Sergei Brin to rank search results. The chapter begins with the construction of an adjacency matrix to model the links between web pages. This is transformed into a transition matrix also called a stochastic or Markov matrix. By adding random surfing, a Google matrix is created. Three implementations of the PageRank algorithm are then introduced. The power method, which iteratively produces better rankings, is explained first. This is followed by an algebraic solution that uses matrix inversion. The final algorithm uses another iterative approach that allows new observations to adjust a previously stored page rank vector. The chapter concludes with the introduction of sparse matrices and a sparse matrix implementation of the iterative PageRank algorithm.

    Admittedly, implementing this many machine-learning algorithms in q may be more fun than you can handle. But if you are ready for more, the last chapter is for you. Chapter 14 is dedicated to using q to visualize data. The chapter begins by reviewing the ASCII plotting tool used throughout the book. The next section uses complex numbers to generate a black and white, as well as grayscale, image of the Mandelbrot set. We then use utility functions to store the Mandelbrot matrices in each of the three Netpbm portable bitmap file formats: black and white, grayscale and color. This allows us to view the images in their full glory with most standard graphics programs. The chapter, book and fun comes to an end when we use q to implement an algorithm to generate sparklines and plot financial time-series data.

    1.5. Naming Conventions

    The funq project uses single-letter variables, where possible, for function parameters and local variables. In an attempt at consistency, I list the most common variables and their meanings below. Atoms, vectors, tables and functions use lowercase letters. Matrices, however, are represented with uppercase letters.

    Table 1.1. Naming Conventions


    [3] E. Brynjolfsson and A. McAfee, What’s Driving the Machine Learning Explosion?, Harvard Business Review, 18 Jul 2017.

    [4] The funq.q library also attempts to load a few other libraries, but these require extra packages to be installed are not used in this book.

    Chapter 2. K-Nearest Neighbors

    Machine-learning algorithms can be divided into two types: supervised and unsupervised. Supervised learning algorithms are trained by giving them a set of inputs and told explicitly what the output should be. Unsupervised learning algorithms, however, are only given input data sets and are expected to find models that describe the relationship between data points. Where unsupervised learning algorithms only model the provided data, supervised learning algorithms build models on training data sets so the models can then be used to make predictions on as-yet unseen test data sets.

    Building a model to describe the training data has two benefits:

    The costly process of building a model can be performed offline, while classification and regression can be quickly performed online.

    Large amounts of training data can be used to build the model and then left behind during prediction.

    These algorithms are known as eager learning algorithms because they consume their training data immediately. While the remainder of this book will focus on eager learning examples, this chapter is about the lazy learning algorithm called k-nearest neighbors. It is lazy because all computations on the training data are deferred until the predictions are required. The k-nearest neighbors algorithm is also known as an instance-based algorithm because the training instances are kept with the model and used to make predictions for new observations. Algorithms that build models, and leave the training data behind, are intuitively called model-based algorithms.

    The k-nearest neighbors algorithm is the simplest and most intuitive supervised learning algorithm. It begins by building a matrix of distances between the training and test data set. The algorithm then attempts to make a classification (regression) prediction by using the labels (values) of the k nearest training data points to classify (predict) the label (value) of the new observations.

    2.1. K-Nearest Neighbors Example

    Let’s start with a quick demonstration of how the funq project can use the k-NN algorithm to model the iris data set. This may seem like a cursory treatment of the material. Have no fear, we will dissect every line of code as the chapter progresses.

    The following code loads the funq.q library and the iris data set from the iris.q library.[5] It then splits the data into training and test partitions and extracts the data into matrices X and Xt, representing the training and test observations, and vectors y and yt representing the training and test species labels.

    q)\l funq.q

    q)\l iris.q

    [down]loading iris data set

    q)d:.ut.part[`train`test!7 3;0N?] iris.t

    q)y:first get first `Y`X set' 0 1 cut value flip d`train

    q)yt:first get first `Yt`Xt set' 0 1 cut value flip d`test

    Using the training partition to classify each unseen Iris observation in the test partition with the k-nearest neighbors algorithm obtains an accuracy of 98% when only the nearest neighbor is used.

    q)avg yt=.ml.knn[1f%;1;y] .ml.edist[X] each flip Xt

    0.9777778

    Using three neighbors manages to make a perfect prediction.

    q)avg yt=.ml.knn[1f%;3;y] .ml.edist[X] each flip Xt

    1f

    And finally, the performance degrades when we start using more distant neighbors.

    q)avg yt=.ml.knn[1f%;7;y] .ml.edist[X] each flip Xt

    0.9555556

    This is just one example of how k-nearest neighbors can be parameterized. The algorithm requires us to make three choices:

    Which distance metric should we use?

    How many neighbors should we include?

    How much weight should we give each neighbor?

    Before addressing each

    Enjoying the preview?
    Page 1 of 1