Ebook590 pages5 hours

Feature Engineering Bookcamp

Name: Feature Engineering Bookcamp
Author: Sinan Ozdemir
ISBN: 9781638351405

By Sinan Ozdemir

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Deliver huge improvements to your machine learning pipelines without spending hours fine-tuning parameters! This book’s practical case-studies reveal feature engineering techniques that upgrade your data wrangling—and your ML results.

In Feature Engineering Bookcamp you will learn how to:

    Identify and implement feature transformations for your data
    Build powerful machine learning pipelines with unstructured data like text and images
    Quantify and minimize bias in machine learning pipelines at the data level
    Use feature stores to build real-time feature engineering pipelines
    Enhance existing machine learning pipelines by manipulating the input data
    Use state-of-the-art deep learning models to extract hidden patterns in data

Feature Engineering Bookcamp guides you through a collection of projects that give you hands-on practice with core feature engineering techniques. You’ll work with feature engineering practices that speed up the time it takes to process data and deliver real improvements in your model’s performance. This instantly-useful book skips the abstract mathematical theory and minutely-detailed formulas; instead you’ll learn through interesting code-driven case studies, including tweet classification, COVID detection, recidivism prediction, stock price movement detection, and more.

About the technology
Get better output from machine learning pipelines by improving your training data! Use feature engineering, a machine learning technique for designing relevant input variables based on your existing data, to simplify training and enhance model performance. While fine-tuning hyperparameters or tweaking models may give you a minor performance bump, feature engineering delivers dramatic improvements by transforming your data pipeline.

About the book
Feature Engineering Bookcamp walks you through six hands-on projects where you’ll learn to upgrade your training data using feature engineering. Each chapter explores a new code-driven case study, taken from real-world industries like finance and healthcare. You’ll practice cleaning and transforming data, mitigating bias, and more. The book is full of performance-enhancing tips for all major ML subdomains—from natural language processing to time-series analysis.

What's inside

    Identify and implement feature transformations
    Build machine learning pipelines with unstructured data
    Quantify and minimize bias in ML pipelines
    Use feature stores to build real-time feature engineering pipelines
    Enhance existing pipelines by manipulating input data

About the reader
For experienced machine learning engineers familiar with Python.

About the author
Sinan Ozdemir is the founder and CTO of Shiba, a former lecturer of Data Science at Johns Hopkins University, and the author of multiple textbooks on data science and machine learning.

Table of Contents
1 Introduction to feature engineering
2 The basics of feature engineering
3 Healthcare: Diagnosing COVID-19
4 Bias and fairness: Modeling recidivism
5 Natural language processing: Classifying social media sentiment
6 Computer vision: Object recognition
7 Time series analysis: Day trading with machine learning
8 Feature stores
9 Putting it all together

Skip carousel

LanguageEnglish

PublisherManning

Release dateOct 18, 2022

ISBN9781638351405

Author

Sinan Ozdemir

Related authors

Skip carousel

Related to Feature Engineering Bookcamp

Related ebooks

Skip carousel

Machine Learning Engineering in Action
Ebook
Machine Learning Engineering in Action
byBen Wilson
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Practical Recommender Systems
Ebook
Practical Recommender Systems
byKim Falk
Rating: 5 out of 5 stars
5/5
How to Lead in Data Science
Ebook
How to Lead in Data Science
byJike Chong
Rating: 0 out of 5 stars
0 ratings
Machine Learning Bookcamp: Build a portfolio of real-life projects
Ebook
Machine Learning Bookcamp: Build a portfolio of real-life projects
byAlexey Grigorev
Rating: 4 out of 5 stars
4/5
Street Coder: The rules to break and how to break them
Ebook
Street Coder: The rules to break and how to break them
bySedat Kapanoglu
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Structured Data
Ebook
Deep Learning with Structured Data
byMark Ryan
Rating: 0 out of 5 stars
0 ratings
Interpretable AI: Building explainable machine learning systems
Ebook
Interpretable AI: Building explainable machine learning systems
byAjay Thampi
Rating: 0 out of 5 stars
0 ratings
Think Like a Data Scientist: Tackle the data science process step-by-step
Ebook
Think Like a Data Scientist: Tackle the data science process step-by-step
byBrian Godsey
Rating: 0 out of 5 stars
0 ratings
Grokking Machine Learning
Ebook
Grokking Machine Learning
byLuis Serrano
Rating: 0 out of 5 stars
0 ratings
TensorFlow in Action
Ebook
TensorFlow in Action
byThushan Ganegedara
Rating: 0 out of 5 stars
0 ratings
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
Graph-Powered Machine Learning
Ebook
Graph-Powered Machine Learning
byAlessandro Negro
Rating: 0 out of 5 stars
0 ratings
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Ebook
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
byAvishek Nag
Rating: 0 out of 5 stars
0 ratings
GANs in Action: Deep learning with Generative Adversarial Networks
Ebook
GANs in Action: Deep learning with Generative Adversarial Networks
byVladimir Bok
Rating: 0 out of 5 stars
0 ratings
Data-Oriented Programming: Reduce software complexity
Ebook
Data-Oriented Programming: Reduce software complexity
byYehonathan Sharvit
Rating: 4 out of 5 stars
4/5
Collective Intelligence in Action
Ebook
Collective Intelligence in Action
bySatnam Alag
Rating: 4 out of 5 stars
4/5
Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI
Ebook
Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI
byRobert (Munro) Monarch
Rating: 0 out of 5 stars
0 ratings
Building Machine Learning Systems Using Python: Practice to Train Predictive Models and Analyze Machine Learning Results with Real Use-Cases (English Edition)
Ebook
Building Machine Learning Systems Using Python: Practice to Train Predictive Models and Analyze Machine Learning Results with Real Use-Cases (English Edition)
byDeepti Chopra
Rating: 0 out of 5 stars
0 ratings
Build a Career in Data Science
Ebook
Build a Career in Data Science
byEmily Robinson
Rating: 5 out of 5 stars
5/5
Deep Learning Patterns and Practices
Ebook
Deep Learning Patterns and Practices
byAndrew Ferlitsch
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Machine Learning with TensorFlow, Second Edition
Ebook
Machine Learning with TensorFlow, Second Edition
byChris Mattmann
Rating: 0 out of 5 stars
0 ratings
Skills of a Successful Software Engineer
Ebook
Skills of a Successful Software Engineer
byFernando Doglio
Rating: 0 out of 5 stars
0 ratings
Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform
Ebook
Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform
byRosemary Wang
Rating: 0 out of 5 stars
0 ratings
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
Ebook
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
byPramod Singh
Rating: 0 out of 5 stars
0 ratings
Troubleshooting Java: Read, debug, and optimize JVM applications
Ebook
Troubleshooting Java: Read, debug, and optimize JVM applications
byLaurentiu Spilca
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
Ebook
Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition)
byRituraj Dixit
Rating: 0 out of 5 stars
0 ratings
Deep Learning with C#, .Net and Kelp.Net: The Ultimate Kelp.Net Deep Learning Guide
Ebook
Deep Learning with C#, .Net and Kelp.Net: The Ultimate Kelp.Net Deep Learning Guide
byMatt R. Cole
Rating: 0 out of 5 stars
0 ratings

Intelligence (AI) & Semantics For You

Skip carousel

2084: Artificial Intelligence and the Future of Humanity
Ebook
2084: Artificial Intelligence and the Future of Humanity
byJohn C Lennox
Rating: 4 out of 5 stars
4/5
Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
101 Midjourney Prompt Secrets
Ebook
101 Midjourney Prompt Secrets
byMarcus Byrne
Rating: 3 out of 5 stars
3/5
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
Ebook
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
byMatthew Hayes
Rating: 0 out of 5 stars
0 ratings
ChatGPT For Fiction Writing: AI for Authors
Ebook
ChatGPT For Fiction Writing: AI for Authors
byNova Leigh
Rating: 5 out of 5 stars
5/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Our Final Invention: Artificial Intelligence and the End of the Human Era
Ebook
Our Final Invention: Artificial Intelligence and the End of the Human Era
byJames Barrat
Rating: 4 out of 5 stars
4/5
Impromptu: Amplifying Our Humanity Through AI
Ebook
Impromptu: Amplifying Our Humanity Through AI
byReid Hoffman
Rating: 5 out of 5 stars
5/5
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
Ebook
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
byThe Passive Income Strategist
Rating: 4 out of 5 stars
4/5
Summary of Super-Intelligence From Nick Bostrom
Ebook
Summary of Super-Intelligence From Nick Bostrom
bySummary Station
Rating: 5 out of 5 stars
5/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions
Ebook
What Makes Us Human: An Artificial Intelligence Answers Life's Biggest Questions
byJasmine Wang
Rating: 5 out of 5 stars
5/5
Midjourney Mastery - The Ultimate Handbook of Prompts
Ebook
Midjourney Mastery - The Ultimate Handbook of Prompts
byAndreea Todinca
Rating: 5 out of 5 stars
5/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
Ebook
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
byUtpal Chakraborty
Rating: 0 out of 5 stars
0 ratings
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
Ebook
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
byKavita Ganesan
Rating: 0 out of 5 stars
0 ratings
Ways of Being: Animals, Plants, Machines: The Search for a Planetary Intelligence
Ebook
Ways of Being: Animals, Plants, Machines: The Search for a Planetary Intelligence
byJames Bridle
Rating: 4 out of 5 stars
4/5
Discovery Writing with ChatGPT: AI-Powered Storytelling: Three Story Method, #6
Ebook
Discovery Writing with ChatGPT: AI-Powered Storytelling: Three Story Method, #6
byJ. Thorn
Rating: 0 out of 5 stars
0 ratings
AI for Educators: AI for Educators
Ebook
AI for Educators: AI for Educators
byMatt Miller
Rating: 5 out of 5 stars
5/5
The Algorithm of the Universe (A New Perspective to Cognitive AI)
Ebook
The Algorithm of the Universe (A New Perspective to Cognitive AI)
byAncient Philosophy
Rating: 5 out of 5 stars
5/5
ChatGPT For Dummies
Ebook
ChatGPT For Dummies
byPam Baker
Rating: 0 out of 5 stars
0 ratings
Dancing with Qubits: How quantum computing works and how it can change the world
Ebook
Dancing with Qubits: How quantum computing works and how it can change the world
byRobert S. Sutor
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
The Undocumented Web: scraping, private APIs, proxies and “alternative solutions”: What is the undocumented web? Scott and Wes dive into it, discussing APIs, faking, scraping, automation, proxies as well as tips and tricks for best practices. Kyle Prinsloo’s Freelancing & Beyond — Sponsor Kyle Prinsloo teaches you everything...
Podcast episode
The Undocumented Web: scraping, private APIs, proxies and “alternative solutions”: What is the undocumented web? Scott and Wes dive into it, discussing APIs, faking, scraping, automation, proxies as well as tips and tricks for best practices. Kyle Prinsloo’s Freelancing & Beyond — Sponsor Kyle Prinsloo teaches you everything...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
Podcast episode
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Declarative Machine Learning For High Performance Deep Learning Models With Predibase
Podcast episode
Declarative Machine Learning For High Performance Deep Learning Models With Predibase
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14: CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)
Podcast episode
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14: CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Episode 161: Trapped as a QA engineer and trapped as a generalist
Podcast episode
Episode 161: Trapped as a QA engineer and trapped as a generalist
bySoft Skills Engineering
0 ratings
0% found this document useful
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
Podcast episode
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
Podcast episode
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
byData Skeptic
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
Podcast episode
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
byMaintainable
0 ratings
0% found this document useful
Taming Distributed Architecture with Caitie McCaffrey: Distributed systems programming will always be a world of tradeoffs -- there is no silver bullet in the future. But life can be made easier with tactics such as the actor pattern and the use of conflict-free replicated data types (CRDTs). -
Podcast episode
Taming Distributed Architecture with Caitie McCaffrey: Distributed systems programming will always be a world of tradeoffs -- there is no silver bullet in the future. But life can be made easier with tactics such as the actor pattern and the use of conflict-free replicated data types (CRDTs). -
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
Podcast episode
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
A Programmer's Guide to Computer Science with Dr. William Springer: Have you failed a job interview because you don't know computer science? William Springer has a PhD in computer science and his books takes you through what you would have learned while earning a four-year computer science degree! Both Scott and William believe in breaking down boundaries, and it starts with this show!
Podcast episode
A Programmer's Guide to Computer Science with Dr. William Springer: Have you failed a job interview because you don't know computer science? William Springer has a PhD in computer science and his books takes you through what you would have learned while earning a four-year computer science degree! Both Scott and William believe in breaking down boundaries, and it starts with this show!
byHanselminutes with Scott Hanselman
100%
100% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
046 jsAir - React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis: React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis Description: JavaScript is taking the software world by storm, and we're going to talk about yet another enabling technology: React Native. Show sponsors:Egghead.io - Bite-size...
Podcast episode
046 jsAir - React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis: React Native with Bonnie Eisenman, Ken Wheeler, and Tyler McGinnis Description: JavaScript is taking the software world by storm, and we're going to talk about yet another enabling technology: React Native. Show sponsors:Egghead.io - Bite-size...
byJavaScript Air
0 ratings
0% found this document useful
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
Podcast episode
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
byTechStuff
100%
100% found this document useful
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
Podcast episode
[MINI] Long Short Term Memory: Thanks to our sponsor brilliant.org/dataskeptics A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. An...
byData Skeptic
0 ratings
0% found this document useful
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
Podcast episode
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
byCoRecursive: Coding Stories
0 ratings
0% found this document useful
The Past, Present, and Future of Deep Learning In PyTorch: An interview with the creator of the popular PyTorch deep learning framework
Podcast episode
The Past, Present, and Future of Deep Learning In PyTorch: An interview with the creator of the popular PyTorch deep learning framework
byThe Python Podcast.__init__
0 ratings
0% found this document useful
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
Podcast episode
This Week In Machine Learning & AI - 5/20/16: AI at Google I/O, Amazon's Deep Learning DSSTNE: This Week In Machine Learning & AI - May 20, 2016…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Building event-driven microservices with Adam Bellemare: Event-driven architectures are known to improve a…
Podcast episode
Building event-driven microservices with Adam Bellemare: Event-driven architectures are known to improve a…
byCoding Over Cocktails
0 ratings
0% found this document useful
Exploring Product Management in Nonprofits with Steve MacLaughlin: Steve MacLaughlin of Blackbaud shares his insights on what good product management looks like in nonprofit organizations, product managers as decision makers, the importance of benchmarking, and what it means to operate as a data-driven nonprofit.
Podcast episode
Exploring Product Management in Nonprofits with Steve MacLaughlin: Steve MacLaughlin of Blackbaud shares his insights on what good product management looks like in nonprofit organizations, product managers as decision makers, the importance of benchmarking, and what it means to operate as a data-driven nonprofit.
byProduct Thinking
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
Podcast episode
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful

Skip carousel

An Introduction To Rabbitmq
Linux Format
Article
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
Access Your Mac Anywhere
MacLife
Article
Access Your Mac Anywhere
Nov 8, 2022
2 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Create Asynchronous Code With Python
Linux Format
Article
Create Asynchronous Code With Python
Jun 29, 2021
8 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
Introduction to eBPF Revolutionizing Linux Kernel Technology
Techfastly
Article
Introduction to eBPF Revolutionizing Linux Kernel Technology
Apr 1, 2022
6 min read
EBPF To Enhance Kubernetes Monitoring
Techfastly
Article
EBPF To Enhance Kubernetes Monitoring
Apr 1, 2022
The introduction of Docker and Kubernetes has brought a dramatic revolution in the IT industry. Unlike the traditional methods of developing and deploying software, Kubernetes or K8s uses scaling and automated deployment. Thanks to the Linux function
4 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
Traefik Configuration
Linux Format
Article
Traefik Configuration
Mar 10, 2020
In this tutorial we have configured Traefik using command-line switches in our Docker Compose file (the section starting command:). This is the equivalent of starting the application with a whole bunch of command options each time, and while this wou
1 min read
Create A RESTful Server In Go
Linux Format
Article
Create A RESTful Server In Go
Oct 19, 2021
8 min read
Don’t Be Misled by GPT-4’s Gift of Gab
The Atlantic
Article
Don’t Be Misled by GPT-4’s Gift of Gab
Mar 15, 2023
4 min read
The Fundamental Limits of Machine Learning
Nautilus
Article
The Fundamental Limits of Machine Learning
Sep 20, 2016
5 min read
Murena Fairphone 4
Linux Format
Article
Murena Fairphone 4
Aug 22, 2023
5 min read
Putting Your Words In Order
Writing Magazine
Article
Putting Your Words In Order
Jun 3, 2021
5 min read
“There’s Something About Online Meetings That Makes People More Willing To Engage With Each Other”
PC Pro Magazine
Article
“There’s Something About Online Meetings That Makes People More Willing To Engage With Each Other”
Oct 8, 2020
9 min read
Note-taking Applications For Family History
Family Tree UK
Article
Note-taking Applications For Family History
Mar 10, 2023
7 min read
The Most Important Job Skill of This Century
The Atlantic
Article
The Most Important Job Skill of This Century
Feb 8, 2023
8 min read
Just How Do You Become A PC Modder?
APC
Article
Just How Do You Become A PC Modder?
Feb 21, 2022
14 min read
Learning to Love What I Don’t Know
Inc.
Article
Learning to Love What I Don’t Know
Nov 1, 2017
LIKE MANY WHO HAVE made the leap into Startupland, I guessed from the outset that I had a lot to learn. I was right. Indeed, I jumped into the wormhole of blind spots and unknown unknowns. This has been especially true on matters technological. At Io
2 min read
Just How Do You Become A PC Modder?
Maximum PC
Article
Just How Do You Become A PC Modder?
Dec 7, 2021
14 min read
Taming Your Tech Talent
Inc.
Article
Taming Your Tech Talent
Mar 1, 2017
ETELKA LEHOCZKY WHEN ANASTASIA LENG QUIT Google to start Hatch.co, a shopping site for handmade goods, in 2012, one of the skills she’d developed at the tech giant proved crucial. Managing some of the world’s best IT talent gave the marketing specia
2 min read
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Family Tree UK
Article
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Aug 12, 2022
2 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
There’s A New Career In Town
True Love
Article
There’s A New Career In Town
Oct 21, 2019
2 min read
Reaching For Cloud Nine
Saturday Star
Article
Reaching For Cloud Nine
Sep 2, 2023
5 min read
Neural Pathways
Guitar Magazine
Article
Neural Pathways
Jul 2, 2021
5 min read
Mailserver
Linux Format
Article
Mailserver
Feb 7, 2023
4 min read
Technical Interviews May Pinpoint Anxiety Not Skill
Futurity
Article
Technical Interviews May Pinpoint Anxiety Not Skill
Jul 14, 2020
3 min read

Related categories

Skip carousel

Reviews for Feature Engineering Bookcamp

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Feature Engineering Bookcamp - Sinan Ozdemir

1 Introduction to feature engineering

This chapter covers

Understanding the feature engineering and machine learning pipeline

Examining why feature engineering is important to the machine learning process

Taking a look at the types of feature engineering

Understanding how this book is structured and the types of case studies we will focus on

Much of the current discourse around artificial intelligence (AI) and machine learning (ML) is inherently model-centric, focusing on the latest advancements in ML and deep learning. This model-first approach often comes with, at best, little regard for and, at worst, total disregard of the data being used to train said models. Fields like MLOps are exploding with ways to systematically train and utilize ML models with as little human interference as possible to free up the engineer’s time.

Many prominent AI figures are urging data scientists to place more focus on a data-centric view of ML that focuses less on the model selection and hyperparameter-tuning process and more on techniques that enhance the data being ingested and used to train our models. Andrew Ng is on record saying that machine learning is basically feature engineering and that we need to be moving more toward a data-centric approach. Adopting a data-centric approach is especially useful when the following are true:

Datasets have few observations (<10 K), so we can extract as much information as possible from fewer rows.

Datasets have a large number of columns compared to the number of observations. This can lead to what is known as the curse of dimensionality, which describes an extremely sparse universe of data that ML models have difficulty learning from.

Interpretability of the data and model is key.

The domain of the data is inherently complex (e.g., accurate financial modeling is virtually impossible without clean and complete data).

We should be focusing on a part of the ML pipeline that requires arguably the most nuanced and careful deliberation: feature engineering.

In this book, we will dive into the different algorithms and statistical testing procedures used to identify the strongest features, create new ones, and measure ML model success as they relate to the strength of these features. For our purposes, we will define a feature as an attribute or column of data that is meaningful to an ML model. We will make these dives by way of several case studies, each of which belonging to different domains, including healthcare and finance, and will touch on several types of data, including tabular data, text data, image data, and time-series data.

1.1 What is feature engineering, and why does it matter?

The term feature engineering conjures different images for different data scientists. For some data scientists, feature engineering is how we narrow down the features needed for supervised models (e.g., trying to predict a response or outcome variable). For others, it is the methodology used to extract numerical representations from unstructured data for an unsupervised model (e.g., trying to extract structure from a previously unstructured dataset). Feature engineering is both of these and much more.

For the purposes of this book, feature engineering is the art of manipulating and transforming data into a format that optimally represents the underlying problem that an ML algorithm is trying to model and mitigates inherent complexities and biases within the data.

Data practitioners often rely on ML and deep learning algorithms to extract and learn patterns from data even when the data they are using are poorly formatted and non-optimal. Reasons for this range from the practitioner trusting their ML models too much to simply not knowing the best practices for dealing with messy and inconsistent data and hoping that the ML model will just figure it out for them. This approach never even gives the ML models a chance to learn from proper data and dooms the data scientist from the start.

It comes down to whether the data scientist is willing or able to use their data as much as possible by engineering the best possible features for their ML task. If we do not engineer proper features and rely on complex and slow ML models to figure it out for us, we will likely be left with poor ML models. If we instead take the time to understand our data and craft features for our ML models to learn from, we can end up with a smaller, faster models with on-par, or even superior, performance.

When it comes down to it, we want our ML models to perform as well as they possibly can, depending on whatever metric we choose to judge them on. To accomplish this, we can manipulate the data and the model (figure 1.1).

CH01_F01_Ozdemir

Figure 1.1 When taking a more data-centric approach to ML, we are not as concerned with improving the ML code, but instead, we are concerned with manipulating the impute data in such a way that the ML model has an easier time surfacing and using patterns in the data, leading to overall better performance in the pipeline.

This book focuses not on how to optimize ML models but, rather, on techniques for transforming and manipulating data to make it easier for ML models to process and learn from datasets. We will show that there is a whole world of feature engineering techniques that can help the overall ML pipeline that isn’t just picking a better model with better hyperparameters.

1.1.1 Who needs feature engineering?

According to the 2020 State of Data Science survey by Anaconda (see https://www.anaconda.com/state-of-data-science-2020), data wrangling (which we can consider a stand-in term for feature engineering with the added step of data loading) takes up a disproportionate amount of time and, therefore, is on the mind of every data scientist. The survey shows how data management is still taking up a large portion of data scientists’ time. Nearly half of the reported time was spent on data loading and cleansing. The report claims that this was disappointing and that data preparation and cleansing takes valuable time away from real data science work. One thing to note is that data cleansing is a pretty vague term and likely was used as a catchall for exploratory data analysis and all of feature engineering work. We believe that data preparation and feature engineering is a real, vital, and almost always unavoidable part of a data scientist’s work and should be treated with as much respect as the portions of the pipeline that are focused on data modeling.

This book is dedicated to showcasing powerful feature engineering procedures, including model fairness evaluation (in our fairness case study chapter), deep learning-based representation learning (in both our NLP and image analysis case study chapters), hypothesis testing (in our healthcare case study), and more. These feature engineering techniques can affect model performance as much as the model selection and training process.

1.1.2 What feature engineering cannot do

It is important to mention that good feature engineering is not a silver bullet. Feature engineering cannot, for example, solve the problem of too little data for our ML models. While there is no magic threshold for how small is too small, in most cases, when working with datasets of under 1,000 rows, feature engineering can only do so much to squeeze as much information out of those observations as possible. Of course, there are exceptions to this. When we touch on transfer learning in our NLP and image case studies, we will see how pretrained ML models can learn from mere hundreds of observations, but this is only because they’ve been pretrained on hundreds of thousands of observations already.

Feature engineering also cannot create links between features and responses where there are not any. If the features we start with implicitly do not hold any predictive power to our response variable, then no amount of feature engineering will create that link. We could be able to achieve small bumps in performance, but we cannot expect either feature engineering or ML models to magically create relationships between features and responses for us.

1.1.3 Great data, great models

Great models cannot exist without great data. It is virtually impossible to guarantee an accurate and fair model without well-structured data that deeply represents the problem at hand.

I’ve spent the majority of my ML career working with natural language processing (NLP); specifically, I focus on building ML pipelines that can automatically derive and optimize conversational AI architecture from unstructured historical transcripts and knowledge bases. Early on, I spent most of my days focusing on deriving and implementing knowledge graphs and using state-of-the-art transfer learning and sequence-to-sequence models to develop conversational AI pipelines that could learn from raw human-to-human transcripts and be able to update on new topics as new conversations came in.

It was after my most recent AI startup was acquired that I met a conversational architecture designer and linguist named Lauren Senna, who taught me about the deep structure in conversations that she and her teams used to build bots that could outperform any of my auto-derived bots any day of the week. Lauren told me about the psychology of how people talk to and interact with bots and why it differed from how knowledge base articles are written. It was then that I finally realized I needed to spend more time focusing our ML efforts on preprocessing efforts to bring out these latent patterns and structures, so the predictive systems could grab hold of them and become more accurate than ever. She and I were responsible for, in some cases, up to 50% improvement in bot performance, and I would speak at various conferences about how data scientists could utilize similar techniques to unlock patterns in their own data.

Without understanding and respecting the data, I could have never brought out the greatness of the models trying their best to capture, learn from, and scale up the patterns locked within the data.

1.2 The feature engineering pipeline

Before we dive into the feature engineering pipeline, we need to back up a bit and talk about the overall ML pipeline. This is important because the feature engineering pipeline is itself a part of the greater ML pipeline, so this will give us the perspective we need to understand the feature engineering steps.

1.2.1 The machine learning pipeline

The ML pipeline generally consists of five steps (figure 1.2):

Defining the problem domain—What problem are we trying to solve with ML? This is the time to define any characteristics we want to prioritize, like the speed of model predictions or interpretability. These considerations will be crucial when it comes to model evaluation.

Obtaining data that accurately represents the problem we are trying to solve—Think about and implement methods of collecting data that are fair, safe, and respectful of the data providers’ privacy. This is also a great time to perform an exploratory data analysis (EDA) to get a good sense of the data we are working with. I will assume you have done your fair share of EDA on data, and I will do my fair share in this book to help you understand our data as much as possible. If this is a supervised problem, are we going to deal with imbalanced classes? If this is an unsupervised problem, do we have a sample of data that will represent the population well enough to draw good enough insights?

Feature engineering—This is the main focus of this book and the pivotal point in our ML pipeline. This step involves all of the work of creating the optimal representation of data that can be fed into the ML models.

Model selection and training—This is a huge part of the data scientist’s pipeline and should be done diligently and with care. At this stage, we are choosing models that best fit our data and our considerations from step 1. If model interpretability was highlighted as a priority, perhaps, we will stay in the family of tree-based models over deep learning-driven models.

Model deployment and evaluation—At this stage, our data have been prepped, our models have been trained, and it’s time to put our models into production. At this point, the data scientist can consider model versioning and prediction speeds as factors in the readiness of their models. For example, will we need some sort of user interface to obtain predictions synchronously, or can we perform predictions offline? Evaluation processes must be deployed to track out models’ performance over time and look out for model decay.

CH01_F02_Ozdemir

Figure 1.2 The ML pipeline. From left to right: we must understand the problem domain, obtain and understand data, engineer our features (which obviously is the main focus on this book), select and train our models, and then deploy models with the understanding that we may need to double back to any of the past steps if evaluations of the models show any kind of data or concept drift that would manifest as model decay—a drop in performance over time for our ML model.

Tip Speaking of problem domain, it isn’t required to be an expert in a particular domain to be a data scientist working on problems in said field. That being said, I would strongly encourage you to, at the very least, reach out to experts in a field and do some research to get yourself in a position where you can understand the potential pros and cons of architecting ML pipelines that may affect people.

In the last step of the ML pipeline, we also need to watch out for concept drift (when our interpretation of the data changes) and data drift (when the underlying distributions of our data change). These are references to how data may change over time. In this book, we will not need to worry about these concepts, but they are worth taking a moment to explore deeper.

Concept drift is the phenomenon that refers to the statistical properties of a feature or the response that has changed over time. If we train a model on a dataset at a point in time, we have, by definition, a snapshot of a function that relates our features to our response. As time progresses, the environment which that data represents may evolve, and how we perceive those features and responses may also change. This idea is most often applied to response variables but can also be considered for our features.

Imagine we are data scientists for a streaming media platform. We are tasked with building a model to predict when we should show a speed bump to the user and ask them whether they are still watching. We can build a basic model to predict this using metrics, such as minutes since they pressed a button or average length of an episode of the show they are currently watching, and our response would be a simple True or False to should we show the speed bump or not? At the time of model creation, our team sat down and, as domain experts, thought of all the ways we may want to show this speed bump. Maybe they fell asleep. Maybe they had to run out for an errand and left it on by accident. So we build a model and deploy it. Two months later, we start to receive requests to increase the time it takes to show the speed bump, and our team gets back together to read the requests. As it turns out, a large group of people (including this author) use streaming media apps to play soothing documentaries for their dogs and cats to help them with their separation anxiety when they leave for long stretches of time. This is a concept that our model was not trying to account for. We now have to add observations and features like Is the show about animals? to help account for this new concept.

Data drift refers to the phenomenon that our data’s underlying distribution has shifted for some reason, but our interpretation of that feature remains unchanged. This is common when there are behavior changes that our models have not accounted for. Imagine we’re back at the streaming media platform. We built a model in late 2019 to predict the number of hours someone would watch a show, given variables such as their past watching habits, types of shows they enjoy, and more, and it was going well. Suddenly, a global pandemic arises, and some of us (no judgment) start watching media online more often, maybe even while we are working to make it sound like people are still around us even while we are home alone. Our response variable’s distribution (which is measured in hours of watch time) will dramatically shift to the right, and our model may not be able to keep up its past performance, given this distribution shift. This is data drift. The concept of hours watched hasn’t changed, but it is our underlying distribution of that response that has changed.

This idea can be applied just as easily to a feature. If hours watched was a feature to a new response variable of Will this person watch the next episode if we offer it to them? the same principles apply, and that dramatic shift in the distribution is something our model hasn’t seen before.

If we zoom in around the middle portion of the ML pipeline, we see feature engineering. Feature engineering, as it is a part of the larger ML pipeline, can be thought of as its own pipeline with its own steps. If we were to double-click and open up the feature engineering box in the ML pipeline, we would see the following steps:

Feature understanding—Recognizing the levels of data we are working with is crucial and will impact which types of feature engineering are available to us. It is at this stage that we will have to, for example, ascertain what level our data belong to. Don’t worry; we will get into the levels of data in the next chapter.

Feature structuring—If any of our data are unstructured (e.g., text, image, video, etc.; see figure 1.3), we must convert them to a structured format, so our ML models can understand them. An example would be converting pieces of text into a vector representation or transforming images into a matrix form. We can use feature extraction or learning to accomplish this.

CH01_F03_Ozdemir

Figure 1.3 Raw data, such as text, audio, images, and videos, must be transformed into numerical vector representations to be processed by any ML algorithm. This process, which we will refer to as feature structuring, can be done through extraction techniques, such as applying a bag-of-words algorithm or using a nonparametric feature learning approach, like autoencoders (both bag-of-words and autoencoders are covered in our NLP case study). We will see both of these methods used in the fourth case study, on natural language processing.

Feature optimization—Once we have a structured representation for our data, we can apply optimizations, such as feature improvement, extraction, construction, and selection, to obtain the best data possible for our models. A majority of day-to-day feature engineering work is usually in this category. A majority of the code examples in this book will revolve around feature optimization. Every case study will have some instances of feature optimization, in which we will have to either create new features or take existing ones and make them more powerful for our ML model.

Feature evaluation—As we alter our feature engineering pipelines to try different scenarios, we will want to see just how effective the feature engineering techniques we’ve applied are going to be. We can achieve this by choosing a single learning algorithm and, perhaps, a few parameter options for quick tuning. We can then compare the applications of different feature engineering pipelines against a constant model to rank which steps of pipelines are performing, given a change in person with and without their appearance. If we are not seeing the performance where we need it to be, we will go back to previous optimization and structuring steps to attempt to get a better data representation (figure 1.4).

CH01_F04_Ozdemir

Figure 1.4 Zooming in on the feature engineering phase of our ML pipeline, we can see the steps it takes to develop proper and successful feature engineering pipelines.

1.3 How this book is organized

A book consisting of many case studies can be hard to organize. On one hand, we want to provide ample context and intuition behind the techniques we are going to use to engineer our features. On the other hand, we recognize the value of examples and code samples to help solidify the concepts.

To that end, we will put both hands together for a high five as we build a narrative around each case study to show end-to-end code that solves a domain-specific problem, while breaking up segments of the code with written sections to explain why we did what we just did and why we are about to do what we are. I hope this will offer up the best of both worlds, showing the reader both hands-on code and high-level thinking about the problem at hand.

1.3.1 The five types of feature engineering

The main focus of this book is on five main categories of feature engineering. We will touch on each of these five categories in the next chapter, and we will continually refer back to them throughout the entire book:

Feature improvement—Making existing features more usable through mathematical transformations

Example—Imputing (filling in) missing temperatures on a weather dataset by inferring them from the other columns

Feature construction—Augmenting the dataset by creating new interpretable features from existing interpretable features

Example—Dividing the total price of home feature by the square foot of home feature to create a price per square foot feature in a home-valuation dataset

Feature selection—Choosing the best subset of features from an existing set of features

Example—After creating the price per square foot feature, possibly removing the previous two features if they don’t add any value to the ML model anymore

Feature extraction—Relying on algorithms to automatically create new, sometimes uninterpretable, features, usually based on making parametric assumptions about the data

Example—Relying on pretrained transfer learning models, like Google’s BERT, to map unstructured text to a structured and generally uninterpretable vector space

Feature learning—Automatically generating a brand new set of features, usually by extracting structure and learning representations from raw unstructured data, such as text, images, and videos, often using deep learning

Example—Training generative adversarial networks (GANs) to deconstruct and reconstruct images for the purposes of learning the optimal representation for a given task

At this point, it is worth noting two things. First, it doesn’t matter if we are working with an ML model that is supervised or unsupervised. This is because features, as we’ve defined them, are attributes that are meaningful to our ML model. So whether our goal is to cluster observations together or predict the price movement of a stock in a few hours, how we engineer our features will make all the difference. Secondly, oftentimes people will perform operations on data that are consistent with feature engineering without the intention of feeding the data into an ML model. For example, someone may want to vectorize text into a bag-of-words representation for the purpose of creating a word cloud visualization, or perhaps, a company needs to impute missing values on customer data to highlight churn statistics. This is, of course, valid, but it will not fit our relatively strict definition of feature engineering as it relates to ML.

If we were to look at the four steps of feature engineering and how our five types of feature engineering fit in, we would end up with a pipeline that shows an end-to-end pipeline for how to ingest and manipulate data for the purpose of engineering features that best help the ML model solve the task at hand. That pipeline would look something like figure 1.5.

CH01_F05_Ozdemir

Figure 1.5 Our final zoom-in on the ML feature engineering pipeline. The feature engineering pipeline consists

Enjoying the preview?

Page 1 of 1

Feature Engineering Bookcamp

About this ebook

Sinan Ozdemir

Related authors

Related to Feature Engineering Bookcamp

Related ebooks

Intelligence (AI) & Semantics For You

Related podcast episodes

Related articles

Related categories

Reviews for Feature Engineering Bookcamp

What did you think?

Book preview

Feature Engineering Bookcamp - Sinan Ozdemir

1 Introduction to feature engineering

This chapter covers

1.1 What is feature engineering, and why does it matter?

1.1.1 Who needs feature engineering?

1.1.2 What feature engineering cannot do

1.1.3 Great data, great models

1.2 The feature engineering pipeline

1.2.1 The machine learning pipeline

1.3 How this book is organized

1.3.1 The five types of feature engineering