Fun Q: A Functional Introduction to Machine Learning in Q
By Nick Psaris
()
About this ebook
Bring the power of machine learning to the fastest time-series database. Fun Q uses the powerful q programming language to implement many of the most famous machine-learning algorithms. Using a meticulously factored machine-learning library, each algorithm is broken into its basic building blocks and then rebuilt from scratch. &nb
Related to Fun Q
Related ebooks
Deep Learning with R Rating: 0 out of 5 stars0 ratingsTime Series Forecasting in Python Rating: 0 out of 5 stars0 ratingsD3.js in Action: Data visualization with JavaScript Rating: 0 out of 5 stars0 ratingsMachine Learning in Action Rating: 0 out of 5 stars0 ratingsAlgorithms and Data Structures for Massive Datasets Rating: 0 out of 5 stars0 ratingsInside Deep Learning: Math, Algorithms, Models Rating: 0 out of 5 stars0 ratingsMastering Large Datasets with Python: Parallelize and Distribute Your Python Code Rating: 0 out of 5 stars0 ratingsDeep Learning with Python Rating: 5 out of 5 stars5/5Machine Learning: A Bayesian and Optimization Perspective Rating: 3 out of 5 stars3/5Bayesian Inference: With Ecological Applications Rating: 1 out of 5 stars1/5Q Tips: Fast, Scalable, and Maintainable Kdb+ Rating: 0 out of 5 stars0 ratingsTensorFlow in Action Rating: 0 out of 5 stars0 ratingsWebsite Scraping with Python: Using BeautifulSoup and Scrapy Rating: 0 out of 5 stars0 ratingsProbabilistic Deep Learning: With Python, Keras and TensorFlow Probability Rating: 0 out of 5 stars0 ratingsPractical C++20 Financial Programming: Problem Solving for Quantitative Finance, Financial Engineering, Business, and Economics Rating: 0 out of 5 stars0 ratingsTheory of Markov Processes Rating: 0 out of 5 stars0 ratingsComputational Finance Using C and C# Rating: 0 out of 5 stars0 ratingsGANs in Action: Deep learning with Generative Adversarial Networks Rating: 0 out of 5 stars0 ratingsTensorflow Machine Learning Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsAngularJS in Action Rating: 0 out of 5 stars0 ratingsMachine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5Practical Recommender Systems Rating: 5 out of 5 stars5/5Handbook of Advanced Mathematics Rating: 0 out of 5 stars0 ratingsScala in Action Rating: 0 out of 5 stars0 ratingsThe Well-Grounded Java Developer: Vital techniques of Java 7 and polyglot programming Rating: 4 out of 5 stars4/5Apache Solr Search Patterns Rating: 0 out of 5 stars0 ratingsSpark GraphX in Action Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsPython: Deeper Insights into Machine Learning Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5Summary of Super-Intelligence From Nick Bostrom Rating: 5 out of 5 stars5/5Our Final Invention: Artificial Intelligence and the End of the Human Era Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Discovery Writing with ChatGPT: AI-Powered Storytelling: Three Story Method, #6 Rating: 0 out of 5 stars0 ratingsImpromptu: Amplifying Our Humanity Through AI Rating: 5 out of 5 stars5/5ChatGPT For Dummies Rating: 0 out of 5 stars0 ratingsMidjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5Ways of Being: Animals, Plants, Machines: The Search for a Planetary Intelligence Rating: 4 out of 5 stars4/5What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions Rating: 5 out of 5 stars5/5The Algorithm of the Universe (A New Perspective to Cognitive AI) Rating: 5 out of 5 stars5/5THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION Rating: 5 out of 5 stars5/5AI for Educators: AI for Educators Rating: 5 out of 5 stars5/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsThe Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications Rating: 0 out of 5 stars0 ratingsHumans Need Not Apply: A Guide to Wealth & Work in the Age of Artificial Intelligence Rating: 4 out of 5 stars4/5
Reviews for Fun Q
0 ratings0 reviews
Book preview
Fun Q - Nick Psaris
Fun Q
A Functional Introduction to Machine Learning in Q
Nick Psaris
Copyright © 2020 Nick Psaris
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher.
Kx and kdb+ are registered trademarks of Kx Systems, Inc.
Books may be purchased in quantity and/or special sales by contacting the publisher, Vector Sigma at sales@vector-sigma.com.
Preface1. Introduction1.1. Who is This Book For?1.2. How to Read This Book1.3. Getting Started1.4. What is in This Book?1.5. Naming Conventions2. K-Nearest Neighbors2.1. K-Nearest Neighbors Example2.2. Iris Data Set2.3. Downloading Data Sets2.4. Loading Data Sets2.5. Partitioning Data2.6. Distance Metrics2.7. Norm Metrics2.8. Minimize Flipping2.9. Finding K Neighbors2.10. Weighted Average or Mode2.11. Weighting Schemes2.12. Euclidean Distance Squared2.13. Pairwise Euclidean Distance Squared2.14. K-Fold Cross Validation3. K-Means3.1. K-Means Example3.2. Centroid Initialization3.3. K-Means++ Initialization3.4. Lloyd’s Algorithm3.5. Ungrouping Clusters3.6. The Elbow Method3.7. Partitional Clustering4. Hierarchical Agglomerative Clustering4.1. Hierarchical Agglomerative Clustering Example4.2. Cluster Benchmark Data Sets4.3. Dissimilarity Matrix4.4. Lance-Williams Algorithm4.5. Linkage Functions4.6. Building Clusters4.7. The Silhouette Method5. Expectation-Maximization5.1. Bernoulli Distribution5.2. Expectation-Maximization Algorithm5.3. Binomial Distribution5.4. Multinomial Distribution5.5. Bernoulli Mixture Models5.6. Gaussian Mixture Model6. Naive Bayes6.1. Fit Naive Bayes6.2. Gaussian Naive Bayes6.3. Multinomial Naive Bayes6.4. Binary Classification6.5. Text Classification7. Decision Trees7.1. Impurity Functions7.2. Iterative Dichotomizer 37.3. Visualizing Decision Trees7.4. Decision Tree Classification7.5. Building the Decision Tree7.6. Gain Functions7.7. Cost Complexity Pruning7.8. K-Fold Cross Validation7.9. Regression Trees7.10. One Hot8. Decision Tree Ensembles8.1. Ensemble Example8.2. Wisconsin Diagnosis Breast Cancer Data Set8.3. Bootstrap Aggregation8.4. Random Forest8.5. Adaptive Boosting8.6. Discrete Adaboost8.7. AdaBoost Cross Validation9. Linear Regression9.1. Matrices9.2. Matrix Operators9.3. Least Squares Regression9.4. Multi-Dimensional Least Squares9.5. Custom Matrix Operators9.6. Normal Equations9.7. Gradient Descent9.8. Linear Regression Cost9.9. Linear Regression Gradient9.10. Feature Scaling9.11. FMINCG9.12. Stopping Conditions9.13. Stochastic Gradient Descent9.14. Regularization9.15. Optimal Regularization10. Logistic Regression10.1. Logistic Regression Example10.2. Wisconsin Diagnosis Breast Cancer Data Set10.3. Sigmoid10.4. Log-Loss Cost10.5. Logistic Regression Cost10.6. Logistic Regression Gradient10.7. Optimal Regularization10.8. Receiver Operating Characteristic10.9. Area Under the Curve10.10. Multi-Class Classification10.11. One-vs.-All Classifier10.12. Fitting One vs. All10.13. Predicting One vs. All11. Neural Networks11.1. Perceptron11.2. Neural Network11.3. Neural Network Example11.4. Neural Network Matrices Initialization11.5. Neural Network Topology11.6. Neural Network Weight Initialization11.7. Neural Network Cost11.8. Neural Network Gradient11.9. Neural Network Activation Functions11.10. Neural Network Cost and Gradient11.11. Neural Network Multinomial Classification11.12. Neural Network Regression12. Recommender Systems12.1. MovieLens Data Set12.2. Content-Based Filtering12.3. Rating Data Summary12.4. Null-Aware Aggregations12.5. User-User Collaborative Filtering12.6. Item-Item Collaborative Filtering12.7. Collaborative Filtering12.8. Collaborative Filtering Cost12.9. Collaborative Filtering Gradient12.10. Stochastic Gradient Descent12.11. Alternating Least Squares13. PageRank13.1. Adjacency Matrix13.2. Markov Matrix13.3. Google Matrix13.4. Power Method13.5. Algebraic PageRank13.6. Iterative PageRank13.7. Sparse Matrices13.8. Sparse Matrix Iterative PageRank14. Graphics14.1. ASCII plots14.2. Mandelbrot Set14.3. Portable Image Formats14.4. Sparklines14.5. Charting Stock PricesA. Source CodeA.1. UtilitiesA.2. Machine-Learning AlgorithmsA.3. External Libraries Ported to QA.4. Data LibrariesIndex
List of Figures
7.1. Weather Decision Tree11.1. Perceptron with Heaviside Activation11.2. Neural Network with Two Hidden Layers13.1. Example Web Page Links14.1. Black and White Mandelbrot Set14.2. Grayscale Mandelbrot Set
List of Examples
2.1. Distance Metrics3.1. K-Means3.2. K-Means Initial Clusters3.3. K-Means Centroids After One Round3.4. K-Means Clusters After One Round3.5. K-Means Centroids After Two Rounds3.6. K-Means Centroids After Three Rounds3.7. K-Means Clusters After Convergence3.8. Contrived K-Means3.9. Sum of Squared Errors Within3.10. Percentage of Total Variance4.1. Large HAC Clusters4.2. Silhouette I4.3. Small HAS Clusters4.4. Silhouette II5.1. Binomial EM Confusion Matrix5.2. Multinomial EM Confusion Matrix5.3. MNIST Digits5.4. Prototypical Digits with Random Noise5.5. Prototypical Digits5.6. Bernoulli Mixture Model EM Confusion Matrix5.7. Gaussian Mixture Model EM Confusion Matrix6.1. Naive Bayes Mistakes6.2. Binary Classification Confusion Matrix7.1. Mean Squared Errors7.2. Misclassification7.3. Entropy7.4. Gini8.1. Bootstrap Aggregation Accuracy8.2. Random Forest Regression Training Error8.3. Random Forest Regression Test Error8.4. Random Forest Classification Training Error8.5. Random Forest Classification Test Error8.6. AdaBoost Training Accuracy8.7. AdaBoost Test Accuracy9.1. Sine9.2. Linear Regression Cost Convergence9.3. Optimal Linear Regression Regularization10.1. Sigmoid10.2. Log-Loss10.3. Optimal Logistic Regression Regularization10.4. Logistic Regression Binary Classification Confusion Matrix10.5. Low Threshold Binary Classification Confusion Matrix10.6. High Threshold Binary Classification Confusion Matrix10.7. Discriminating Receiver Operating Characteristic10.8. Non-discriminating Receiver Operating Characteristic10.9. Negative Discriminating Receiver Operating Characteristic10.10. MNIST Digits10.11. Logistic Regression One-vs.-All Confusion Matrix10.12. Incorrectly Classified MNIST Digits11.1. Neural Network Multinomial Classification Confusion Matrix11.2. Row-Normalized Confusion Matrix11.3. Column-Normalized Confusion Matrix14.1. Heat Map14.2. Axis Labels14.3. Black and White Mandelbrot Set14.4. Gray Scale Mandelbrot Set
List of Equations
2.1. Manhattan Distance2.2. Euclidean Distance2.3. Minkowski Distance2.4. Manhattan Norm2.5. Euclidean Norm2.6. Parameterized Norm2.7. Cosine Similarity2.8. Pairwise Euclidean Distance Squared4.1. Lance-Williams Recursive Update5.1. Binomial Likelihood5.2. Binomial Maximum Likelihood Estimator5.3. Multinomial Likelihood5.4. Multinomial Mixture Model Maximum Likelihood Estimator5.5. Bernoulli Mixture Model Likelihood5.6. Binomial Mixture Model Maximum Likelihood Estimator5.7. Multivariate Gaussian Likelihood5.8. Multivariate Gaussian Maximum Likelihood Estimator7.1. Decision Tree Cost Complexity8.1. AdaBoost Strong Classifier8.2. Exponential Loss9.1. Normal Equations9.2. Regularized Linear Regression Cost9.3. Regularized Linear Regression Gradient9.4. Ridge Regression10.1. Sigmoid10.2. Log-Loss10.3. Regularized Logistic Regression Cost10.4. Regularized Logistic Regression Gradient11.1. Heaviside Step11.2. Neural Network Cost11.3. Sigmoid11.4. Sigmoid Gradient11.5. Hyperbolic Tangent11.6. Hyperbolic Tangent Gradient11.7. ReLU11.8. ReLU Gradient11.9. LReLU11.10. LReLU Gradient11.11. Softmax11.12. Cross Entropy12.1. Collaborative Filtering Cost12.2. Collaborative Filtering Gradient12.3. Matrix-Factorization Optimization12.4. Matrix-Factorization Updates14.1. Complex Multiplication
List of Figures
7.1. Weather Decision Tree11.1. Perceptron with Heaviside Activation11.2. Neural Network with Two Hidden Layers13.1. Example Web Page Links14.1. Black and White Mandelbrot Set14.2. Grayscale Mandelbrot Set
List of Examples
2.1. Distance Metrics3.1. K-Means3.2. K-Means Initial Clusters3.3. K-Means Centroids After One Round3.4. K-Means Clusters After One Round3.5. K-Means Centroids After Two Rounds3.6. K-Means Centroids After Three Rounds3.7. K-Means Clusters After Convergence3.8. Contrived K-Means3.9. Sum of Squared Errors Within3.10. Percentage of Total Variance4.1. Large HAC Clusters4.2. Silhouette I4.3. Small HAS Clusters4.4. Silhouette II5.1. Binomial EM Confusion Matrix5.2. Multinomial EM Confusion Matrix5.3. MNIST Digits5.4. Prototypical Digits with Random Noise5.5. Prototypical Digits5.6. Bernoulli Mixture Model EM Confusion Matrix5.7. Gaussian Mixture Model EM Confusion Matrix6.1. Naive Bayes Mistakes6.2. Binary Classification Confusion Matrix7.1. Mean Squared Errors7.2. Misclassification7.3. Entropy7.4. Gini8.1. Bootstrap Aggregation Accuracy8.2. Random Forest Regression Training Error8.3. Random Forest Regression Test Error8.4. Random Forest Classification Training Error8.5. Random Forest Classification Test Error8.6. AdaBoost Training Accuracy8.7. AdaBoost Test Accuracy9.1. Sine9.2. Linear Regression Cost Convergence9.3. Optimal Linear Regression Regularization10.1. Sigmoid10.2. Log-Loss10.3. Optimal Logistic Regression Regularization10.4. Logistic Regression Binary Classification Confusion Matrix10.5. Low Threshold Binary Classification Confusion Matrix10.6. High Threshold Binary Classification Confusion Matrix10.7. Discriminating Receiver Operating Characteristic10.8. Non-discriminating Receiver Operating Characteristic10.9. Negative Discriminating Receiver Operating Characteristic10.10. MNIST Digits10.11. Logistic Regression One-vs.-All Confusion Matrix10.12. Incorrectly Classified MNIST Digits11.1. Neural Network Multinomial Classification Confusion Matrix11.2. Row-Normalized Confusion Matrix11.3. Column-Normalized Confusion Matrix14.1. Heat Map14.2. Axis Labels14.3. Black and White Mandelbrot Set14.4. Gray Scale Mandelbrot Set
List of Equations
2.1. Manhattan Distance2.2. Euclidean Distance2.3. Minkowski Distance2.4. Manhattan Norm2.5. Euclidean Norm2.6. Parameterized Norm2.7. Cosine Similarity2.8. Pairwise Euclidean Distance Squared4.1. Lance-Williams Recursive Update5.1. Binomial Likelihood5.2. Binomial Maximum Likelihood Estimator5.3. Multinomial Likelihood5.4. Multinomial Mixture Model Maximum Likelihood Estimator5.5. Bernoulli Mixture Model Likelihood5.6. Binomial Mixture Model Maximum Likelihood Estimator5.7. Multivariate Gaussian Likelihood5.8. Multivariate Gaussian Maximum Likelihood Estimator7.1. Decision Tree Cost Complexity8.1. AdaBoost Strong Classifier8.2. Exponential Loss9.1. Normal Equations9.2. Regularized Linear Regression Cost9.3. Regularized Linear Regression Gradient9.4. Ridge Regression10.1. Sigmoid10.2. Log-Loss10.3. Regularized Logistic Regression Cost10.4. Regularized Logistic Regression Gradient11.1. Heaviside Step11.2. Neural Network Cost11.3. Sigmoid11.4. Sigmoid Gradient11.5. Hyperbolic Tangent11.6. Hyperbolic Tangent Gradient11.7. ReLU11.8. ReLU Gradient11.9. LReLU11.10. LReLU Gradient11.11. Softmax11.12. Cross Entropy12.1. Collaborative Filtering Cost12.2. Collaborative Filtering Gradient12.3. Matrix-Factorization Optimization12.4. Matrix-Factorization Updates14.1. Complex Multiplication
Preface
I love data. Let me clarify that. I love structured data. My first job was to administer a Sybase database and the Perl scripts that loaded client positions. I then transitioned to a trading desk where we used Visual Basic for Applications (VBA) to enter trades into a Microsoft Access database. From there, I moved to an automated trading team that stored all data in proprietary self-describing text messages.
Lucky for me, I switched to a team that used kdb+ as the data warehouse and therefore q as the programming language. Over the ensuing years, I continued to use q as my development language of choice, and spent many days—if not weeks—writing functions and C-bindings to import data from (and export data to) different formats including fixed-width binary files and proprietary middleware APIs. The goal in each case was to bring more data to the q ecosystem.
Convinced that q was the language of choice for big data analysis, I turned my attention to bringing analytics to the language as well. Data scientists often treat kdb+ as nothing more than a source of data. To accommodate the limited capacity for large data sets in languages such as R and Python, they are required to aggregate the data and, in the process, lose information. Instead of bringing summary statistics into languages that were not designed for scale, wouldn’t it be better to perform the analysis directly on the data stored in kdb+?
I was first introduced to machine-learning algorithms in 2015 when a friend suggested I take the Coursera Machine Learning course[1] taught by Coursera’s co-founder, Andrew Ng. After completing all the homework assignments in Octave (an open source alternative to Matlab), I decided to re-implement the whole course in q.
My goal was to push q to its limits, find its Achilles heel and keep pushing until it begged for mercy. In the process, I found the functional vector implementation of many algorithms quite elegant. Others, however, were down-right embarrassing. I found that q’s internal data structures were not designed for fast matrix multiplication. With carefully factored code, however, and the use of an open-source math library, the performance could be improved dramatically. I found the lack of complex data types frustrating and wished q had native sparse matrix operators. I found that transposing large matrices severely impacts performance and algorithms must be designed to prevent this. And finally, I found that q’s limits on function, argument and branch length make translating code from other languages very difficult without refactoring.[2]
I became addicted to implementing machine-learning algorithms and continued to read the original papers for algorithms such as decision trees, adaptive boosting and recommender systems. As I added each new algorithm, I refactored the library to maximize code performance and reuse. I wrote each algorithm on the free 32-bit version of kdb+ and limited all interaction (including visualization) to the q terminal. To make the code suitable for inclusion in this book, all functions and comments were required to be less than 77 characters wide. I believe this constraint improved the code clarity by forcing algorithms to be refactored into functions with descriptive names.
I learned that there exists a problem domain where q implementations of machine-learning algorithms, such as k-means, perform well. Others, such as support vector machines, are best delegated to the canonical implementation written in C/C++. This book is dedicated to machine-learning algorithms implemented in pure q.
I enjoy coding in q. In fact, I truly believe coding in q is fun. My goal is to share this passion with you while simultaneously teaching you machine-learning algorithms and concepts as well as many q coding paradigms. Along the way, I hope you have fun too.
[1] Machine Learning,
Coursera. [Online]. Available: https://www.coursera.org/learn/machine-learning. [Accessed: 12-Mar-2020].
[2] Some of these limits have been relaxed with the release of kdb+ 3.6. Instructions embedded within if, do, and while branching control structures were previously limited to 255 bytecodes. This has now been extended to 65,025 bytecodes. The number of permitted constants, as well as the number local and global variables, has also been increased. The limit of 8 function arguments, however, still remains.
Chapter 1. Introduction
Data comes in many shapes: bytes, text, integral and floating. Data comes in many sizes: boolean flags, text config files, binary images and partitioned databases. Stored and ignored, data is expensive to maintain and backup. The quality of the data is only revealed once it is analyzed. And the value of the data is only determined once it is used to make and test predictions.
Analysis can take the form of visual inspection or statistical summary. Predictions are routinely performed and tested in academia and business. The goal of accurate predicting varies slightly between these two fields, however. While academia strives for truth, business searches for profit. And where there is incentive in academia to share data and analysis, business has historically held it captive.
Recent advances in computational analysis and computing power, and the reduction in storage costs, have ignited a data revolution. Whether we call it data science
, machine learning
, deep learning
or artificial intelligence
, the combined power of data and computational analysis reconfirms that we live in the information age
.
Data is worthless if you can’t extract information from it. As Clifford Stoll and Gary Schubert are quoted with saying:
Each step along the path from data to wisdom requires the expenditure of effort and results in an increase in organization and structure. Recent developments in machine-learning techniques have enabled computers to automate the first step from data to information. The success of these algorithms can be attributed to 10 to 100-fold improvement in algorithm design, an increase in computer speeds by at least 100-fold and an explosion in the amount of data (up to 1,000-fold).[3]
Given the universal availability of algorithms and hardware, a remaining problem is accessing the volumes of data. Kdb+ is known not only for its database performance, but also for its flexible programming language q. Would it not make sense, therefore, to bring the machine-learning algorithms to the data stored in a kdb+ database?
1.1. Who is This Book For?
This book assumes an introductory knowledge of kdb+. It also assumes you have q installed (or know how to install it). Most q developers use the language to build tickerplants, databases or CEP engine. Few use q as an analytics platform, however. This is beginning to change. This book is for the community of developers who, like me, have fallen in love with programming in q and want to extend their computing knowledge into the world of big data and machine-learning algorithms while simultaneously learning how to write better q. It is also for the data scientists who find that the data they need—or want—to analyze is increasingly being stored in a kdb+ database. Using their existing machine-learning knowledge, they will be able to learn q programming idioms that will improve their ability to write efficient code.
1.2. How to Read This Book
This book was written to be read linearly from beginning to end. All code required to run the examples in these chapters is explained when first introduced. You may, of course, jump to future chapters if you are comfortable using functions without understanding their implementation. The last chapter, Chapter 14, is an exception and can be read at any time you are looking for some fun.
I am excited that you purchased this book and are interested in both machine learning and q. Having the book on your desk is a great first step, but the knowledge embedded within won’t magically jump to your brain. I suggest that you run each of the examples yourself. Play with the inputs, examine the outputs and perhaps step through the algorithms by declaring variables with the same name as the function parameters. This is made easier by the consistent naming conventions used in the funq project and listed in Section 1.5. I wrote this book for you and hope you make it through each chapter.
1.3. Getting Started
The source code for this book is included in Appendix A for reference, but can—and should—be downloaded from https://www.fun-q.net. To ensure compatibility with the examples in this book, download release v1.0.0. Once downloaded, you should start your q process from the funq project folder. In this example, I’ve downloaded the 32-bit version of kdb+ 4.0. I always use the -s
bash-3.2$ cd funq
bash-3.2$ q -s 4
KDB+ 4.0 2020.06.18 Copyright (C) 1993-2020 Kx Systems
m32/ 4(4)core 8192MB nick nicks-macbookpro.local 192.168.1.3 NONEXPIRE
q)
There are four main libraries that include the utilities and machine-learning algorithms: ut.q, ml.q, fmincg.q and porter.q. The funq.q library loads each of these files.[4]
q)\l funq.q
q)
In addition to the directories distributed with q, the ut, ml, fmincg and porter directories should now be visible.
q)key `
`q`Q`h`j`o`ut`ml`fmincg`porter
The funq project also includes many other libraries dedicated to downloading data from the internet and loading them into memory. We can, for example, load the full text of Pride and Prejudice by Jane Austen.
q)\l pandp.q
[down]loading pride and prejudice text
http://www.gutenberg.org/files/1342/1342_0.txt
The text, with each chapter represented by a vector of characters, can then be found in pandp.s.
q)pandp.s
"It is a truth universally acknowledged, that a single man in\npossession..
"Mr. Bennet was among the earliest of those who waited on Mr.\nBingley. H..
"Not all that Mrs. Bennet, however, with the assistance of her\nfive daug..
"When Jane and Elizabeth were alone, the former, who had been\ncautious i..
"Within a short walk of Longbourn lived a family with whom the\nBennets w..
"The ladies of Longbourn soon waited on those of Netherfield. The\nvisit ..
"Mr. Bennet's property consisted almost entirely in an estate of\ntwo tho..
..
Other data sets automatically load their contents as tables, matrices and vectors—depending on the appropriate abstraction.
Writing a book that requires data is a brittle process. The examples will break if the data sets move location or change format. All data sets are, therefore, loaded from a dedicated q source file. If future modifications are required, the funq project will be updated to ensure the examples remain valid. Also, if you find that a data set required in the examples below has changed location, please raise an issue in the funq repository.
Installing 7-Zip
The funq project was written with the expectation that it can be run on any platform that kdb+ is supported: Windows, Linux, Solaris and Mac operating systems. Although Linux, Solaris, and Mac come with utilities to uncompress data files in various formats such as .zip, .gz and .tar.gz, Windows does not. To uncompress files on a machine running the Windows operating system, the ut.q library uses the 7-Zip utility (7z.exe). This utility can be downloaded from https://www.7-zip.org.
1.4. What is in This Book?
We begin our journey through machine-learning algorithms in Chapter 2 with the most intuitive supervised learning algorithm: k-nearest neighbors (k-NN). The chapter starts by introducing the funq project, which includes all the code required to run the examples in this book. It then demonstrates the project’s ability to download external data sets by loading, and performing k-NN analysis on, the ever-popular Iris data set. It then takes a step back to explain different distance metrics that are used in machine learning and their relationship with norm metrics. The k-NN implementation is then discussed and examples of using it with different distance metrics are presented. A few common optimizations are then reviewed and the chapter concludes with a presentation of the confusion matrix and a discussion on using cross validation to pick the best value for k (the number of neighbors).
We next turn to a simple, yet elegant, unsupervised clustering algorithm in Chapter 3 called k-means. To explain the k-means algorithm, we discuss the implementation of the Lloyd
method of iterative assignment and update steps, which results in the convergence of k distinct clusters. Three different initialization techniques are covered including random cluster selection, random cluster assignment and an algorithm specifically designed to produce distant clusters: k-means++. The elbow method is then used to pick the optimal number of neighbors. To demonstrate the flexibility of the algorithm, different distance metrics are used to implement k-medians and k-harmonic means. The chapter ends with a discussion of k-medoids, which requires the centroids—the center of each cluster—be actual points within the data set.
We continue our discussion on clustering techniques in Chapter 4 by presenting hierarchical agglomerative clustering. The chapter begins by downloading and describing the University of Eastern Finland’s clustering benchmark data set. After first building a dissimilarity matrix, the iterative Lance-Williams algorithm is dissected and its efficiency and flexibility are demonstrated. Different linkage functions are defined and Ward’s method is explained. Methods for generating and ungrouping clusters are reviewed before the chapter concludes by demonstrating how the silhouette cluster quality method can be used to determine the optimal number of clusters.
The final clustering algorithm, expectation-maximization (EM), is described in Chapter 5. The expectation-maximization algorithm, which is used to estimate distribution parameters, can be used as a generalization of the k-means clustering algorithm. Where the k-means algorithm assigns each observation to a single cluster, the expectation-maximization algorithm assigns each observation a probability of belonging to each of the specified clusters. Probabilities require distributions and the chapter introduces the binomial, multinomial, Bernoulli mixture model and Gaussian mixture model distributions using coin flips, dice rolls, the MNIST handwritten-digit data set and the Iris data set respectively. The concepts of likelihood and log likelihood estimators are also introduced.
Chapter 6 returns to supervised learning algorithms and the use of probability distributions with the naive Bayes algorithm. The Iris data set is used again to demonstrate the implementation of Gaussian naive Bayes while the SMS spam data set is analyzed with multinomial naive Bayes. The SMS spam data set is our first introduction to text processing—also referred to as natural language processing (NLP). We then introduce a q implementation of the Porter stemmer--which allows us to combine English words with the same meaning: transformer and transformers for example. We then review binary classification statistics that allow us to quantify how good our binary predictions are. The chapter concludes by introducing the term-frequency inverse document frequency (TF-IDF) statistic and uses it to predict which Jane Austen book a randomly selected chapter comes from.
Chapter 7 describes a powerful (and perhaps my favorite) machine-learning algorithm—decision trees. By continuously splitting a data set based on the values of each feature, we build a tree that, most importantly, creates an intuitive model of the underlying data. Decision trees have a long history and many competing designs. The funq project provides a general-purpose tree-building algorithm that is parameterized to allow the generation of multiple implementations. After the chapter introduces the impurity functions, which are used to determine how much information is in each of the data set features, the ID3 classification-tree algorithm is examined and applied to the famous weather data set. Before proceeding to review more complex classification and regression trees (CART), the chapter demonstrates two visualization tools: the first produces a terminal-appropriate text tree while the second generates output that can be used to produce an image with the Graphviz tool. Digging into the algorithm, the chapter continues to discuss different information-gain calculations, which are used to identify which feature should be split at each branch in the tree. The Iris data set is once again used to demonstrate the power of decision trees. To prevent overly bushy trees and their tendency to over fit, a pruning technique and cross validation are used to pick the appropriate tree complexity. Regression trees that can be used to predict numeric values are then introduced and the chapter concludes by explaining how one-hot encoding can be used to transform categorical features into multiple features with binary values.
If one decision tree is good, multiple decision trees must be great. Chapter 8 introduces the concept of ensemble learning to distinguish between benign and malignant tumors in the Wisconsin Diagnosis Breast Cancer data set. By combining multiple fully grown trees, we can reduce the tendency of single trees to overfit. This is explained in the section about random forests. The second ensemble method combines multiple decision stumps—decision trees with a single branch—in an algorithm called AdaBoost. By expertly combining multiple weak classifiers, we build a powerful strong classifier. The chapter concludes by using cross validation to determine the optimal number of decision stumps to combine.
We next take our first step towards the magic of deep learning. The journey starts in Chapter 9 with linear regression. Matrices and matrix operators are reviewed and used with the wine-quality data set to predict the quality of red wine. We then review abstractions built into the funq project that enable faster matrix operations. After presenting the normal equations, gradient descent is introduced so optimization problems without closed-form solutions can be solved. After describing the concept of feature scaling, we present functions to perform conjugate gradient and stochastic gradient descent. L1 and L2 regularization functions are reviewed before their relationship with LASSO, ridge regression and elastic net is explained. The chapter ends with a review of the cross-validation technique used to determine the best regularization parameters.
A popular binary classification algorithm called logistic regression is used with the Wisconsin Diagnosis Breast Cancer data set in Chapter 10 to model tumor malignancy. The chapter begins by introducing the sigmoid function and the log-loss cost function. In order to find optimal model parameters with gradient descent, the logistic regression cost and gradient functions are defined. Multi-class classification then explained and a one-vs.-all classifier is implemented. The chapter concludes by testing the accuracy of the algorithm on the MNIST handwritten-digit data set.
Having warmed up with a lap through logistic regression, Chapter 11 sprints through the neural network finish line. The chapter begins with a brief history of the perceptron before describing a fully-connected neural network with a single hidden layer. To prevent the common vanishing-gradient problem, two different parameter-initialization techniques are reviewed. The cost and gradient functions for a neural network with a sigmoid activation function and log-loss cost function are introduced along with the concepts of feed-forward loss calculation and backward propagation. To demonstrate the flexibility of the functional neural network implementation, different activation functions—such as the hyperbolic tangent and rectified linear unit relu—and cost functions—such as the cross entropy and softmax functions—are used to analyze the MNIST handwritten-digit data set. The chapter concludes by demonstrating how, in addition to classification problems, neural networks can be used to model regression problems as well.
With the concepts of linear regression, k-nearest neighbors, gradient descent and regularization firmly in our tool belt, Chapter 12 leverages these techniques to implement movie recommender systems. The chapter begins by reviewing the MovieLens data set and uses linear regularization to create a content-based recommender system. By using genre labels such as Romance
and Comedy
, the model is able to recommend movies that match the genres that we have ranked the highest. We then implement a user-user collaborative filtering recommender system that uses k-nearest neighbors to recommend movies other people with similar tastes also liked. We then introduce two matrix-factorization techniques that decompose the rating matrix into user and item matrices that reflect the user’s interest in, and the movie’s exposure to, hidden features. The first technique uses gradient descent and the second uses a method called alternating least squares.
We finish our discussion of machine learning with an algorithm that made it to the top 10 machine-learning algorithms in 2006. It has since dropped off the list but remains a powerful algorithm that affects our everyday lives. Chapter 13 explains the PageRank algorithm, which was developed by Larry Page and Sergei Brin to rank search results. The chapter begins with the construction of an adjacency matrix to model the links between web pages. This is transformed into a transition matrix also called a stochastic or Markov matrix. By adding random surfing, a Google matrix is created. Three implementations of the PageRank algorithm are then introduced. The power method, which iteratively produces better rankings, is explained first. This is followed by an algebraic solution that uses matrix inversion. The final algorithm uses another iterative approach that allows new observations to adjust a previously stored page rank vector. The chapter concludes with the introduction of sparse matrices and a sparse matrix implementation of the iterative PageRank algorithm.
Admittedly, implementing this many machine-learning algorithms in q may be more fun than you can handle. But if you are ready for more, the last chapter is for you. Chapter 14 is dedicated to using q to visualize data. The chapter begins by reviewing the ASCII plotting tool used throughout the book. The next section uses complex numbers to generate a black and white, as well as grayscale, image of the Mandelbrot set. We then use utility functions to store the Mandelbrot matrices in each of the three Netpbm portable bitmap file formats: black and white, grayscale and color. This allows us to view the images in their full glory with most standard graphics programs. The chapter, book and fun comes to an end when we use q to implement an algorithm to generate sparklines and plot financial time-series data.
1.5. Naming Conventions
The funq project uses single-letter variables, where possible, for function parameters and local variables. In an attempt at consistency, I list the most common variables and their meanings below. Atoms, vectors, tables and functions use lowercase letters. Matrices, however, are represented with uppercase letters.
Table 1.1. Naming Conventions
[3] E. Brynjolfsson and A. McAfee, What’s Driving the Machine Learning Explosion?,
Harvard Business Review, 18 Jul 2017.
[4] The funq.q library also attempts to load a few other libraries, but these require extra packages to be installed are not used in this book.
Chapter 2. K-Nearest Neighbors
Machine-learning algorithms can be divided into two types: supervised and unsupervised. Supervised learning algorithms are trained by giving them a set of inputs and told explicitly what the output should be. Unsupervised learning algorithms, however, are only given input data sets and are expected to find models that describe the relationship between data points. Where unsupervised learning algorithms only model the provided data, supervised learning algorithms build models on training data sets so the models can then be used to make predictions on as-yet unseen test data sets.
Building a model to describe the training data has two benefits:
The costly process of building a model can be performed offline, while classification and regression can be quickly performed online.
Large amounts of training data can be used to build the model and then left behind during prediction.
These algorithms are known as eager learning algorithms because they consume their training data immediately. While the remainder of this book will focus on eager learning examples, this chapter is about the lazy learning algorithm called k-nearest neighbors. It is lazy because all computations on the training data are deferred until the predictions are required. The k-nearest neighbors algorithm is also known as an instance-based algorithm because the training instances are kept with the model and used to make predictions for new observations. Algorithms that build models, and leave the training data behind, are intuitively called model-based algorithms.
The k-nearest neighbors algorithm is the simplest and most intuitive supervised learning algorithm. It begins by building a matrix of distances between the training and test data set. The algorithm then attempts to make a classification (regression) prediction by using the labels (values) of the k nearest training data points to classify (predict) the label (value) of the new observations.
2.1. K-Nearest Neighbors Example
Let’s start with a quick demonstration of how the funq project can use the k-NN algorithm to model the iris data set. This may seem like a cursory treatment of the material. Have no fear, we will dissect every line of code as the chapter progresses.
The following code loads the funq.q library and the iris data set from the iris.q library.[5] It then splits the data into training and test partitions and extracts the data into matrices X and Xt, representing the training and test observations, and vectors y and yt representing the training and test species labels.
q)\l funq.q
q)\l iris.q
[down]loading iris data set
q)d:.ut.part[`train`test!7 3;0N?] iris.t
q)y:first get first `Y`X set' 0 1 cut value flip d`train
q)yt:first get first `Yt`Xt set' 0 1 cut value flip d`test
Using the training partition to classify each unseen Iris observation in the test partition with the k-nearest neighbors algorithm obtains an accuracy of 98% when only the nearest neighbor is used.
q)avg yt=.ml.knn[1f%;1;y] .ml.edist[X] each flip Xt
0.9777778
Using three neighbors manages to make a perfect prediction.
q)avg yt=.ml.knn[1f%;3;y] .ml.edist[X] each flip Xt
1f
And finally, the performance degrades when we start using more distant neighbors.
q)avg yt=.ml.knn[1f%;7;y] .ml.edist[X] each flip Xt
0.9555556
This is just one example of how k-nearest neighbors can be parameterized. The algorithm requires us to make three choices:
Which distance metric should we use?
How many neighbors should we include?
How much weight should we give each neighbor?
Before addressing each