Recent Advances in Ensembles for Feature Selection

Ebook441 pages4 hours

Recent Advances in Ensembles for Feature Selection

Name: Recent Advances in Ensembles for Feature Selection
Author: Verónica Bolón-Canedo
ISBN: 9783319900803

By Verónica Bolón-Canedo and Amparo Alonso-Betanzos

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book offers a comprehensive overview of ensemble learning in the field of feature selection (FS), which consists of combining the output of multiple methods to obtain better results than any single method. It reviews various techniques for combining partial results, measuring diversity and evaluating ensemble performance.

With the advent of Big Data, feature selection (FS) has become more necessary than ever to achieve dimensionality reduction. With so many methods available, it is difficult to choose the most appropriate one for a given setting, thus making the ensemble paradigm an interesting alternative.

The authors first focus on the foundations of ensemble learning and classical approaches, before diving into the specific aspects of ensembles for FS, such as combining partial results, measuring diversity and evaluating ensemble performance. Lastly, the book shows examples of successful applications of ensembles for FS and introduces the new challenges thatresearchers now face. As such, the book offers a valuable guide for all practitioners, researchers and graduate students in the areas of machine learning and data mining.

Skip carousel

LanguageEnglish

PublisherSpringer

Release dateApr 30, 2018

ISBN9783319900803

Author

Verónica Bolón-Canedo

Related authors

Skip carousel

Related to Recent Advances in Ensembles for Feature Selection

Titles in the series (4)

Skip carousel

Fuzzy Cognitive Maps for Applied Sciences and Engineering: From Fundamentals to Extensions and Learning Algorithms
Ebook
Fuzzy Cognitive Maps for Applied Sciences and Engineering: From Fundamentals to Extensions and Learning Algorithms
byElpiniki I. Papageorgiou
Rating: 0 out of 5 stars
0 ratings
Metacognition: Fundaments, Applications, and Trends: A Profile of the Current State-Of-The-Art
Ebook
Metacognition: Fundaments, Applications, and Trends: A Profile of the Current State-Of-The-Art
byAlejandro Peña-Ayala
Rating: 0 out of 5 stars
0 ratings
Recent Advances in Ensembles for Feature Selection
Ebook
Recent Advances in Ensembles for Feature Selection
byVerónica Bolón-Canedo
Rating: 0 out of 5 stars
0 ratings
Computer Vision in Advanced Control Systems-5: Advanced Decisions in Technical and Medical Applications
Ebook
Computer Vision in Advanced Control Systems-5: Advanced Decisions in Technical and Medical Applications
byMargarita N. Favorskaya
Rating: 0 out of 5 stars
0 ratings

Related ebooks

Skip carousel

Machine Learning with Noisy Labels: Definitions, Theory, Techniques and Solutions
Ebook
Machine Learning with Noisy Labels: Definitions, Theory, Techniques and Solutions
byGustavo Carneiro
Rating: 0 out of 5 stars
0 ratings
Input-Output Models for Sustainable Industrial Systems: Implementation Using LINGO
Ebook
Input-Output Models for Sustainable Industrial Systems: Implementation Using LINGO
byRaymond R. Tan
Rating: 0 out of 5 stars
0 ratings
Troubleshooting Finite-Element Modeling with Abaqus: With Application in Structural Engineering Analysis
Ebook
Troubleshooting Finite-Element Modeling with Abaqus: With Application in Structural Engineering Analysis
byRaphael Jean Boulbes
Rating: 0 out of 5 stars
0 ratings
Multicriteria Portfolio Construction with Python
Ebook
Multicriteria Portfolio Construction with Python
byElissaios Sarmas
Rating: 0 out of 5 stars
0 ratings
Cognitive Information Systems in Management Sciences
Ebook
Cognitive Information Systems in Management Sciences
byLidia Dominika Ogiela
Rating: 0 out of 5 stars
0 ratings
Supervised Learning with Python: Concepts and Practical Implementation Using Python
Ebook
Supervised Learning with Python: Concepts and Practical Implementation Using Python
byVaibhav Verdhan
Rating: 0 out of 5 stars
0 ratings
Introduction to Algorithms for Data Mining and Machine Learning
Ebook
Introduction to Algorithms for Data Mining and Machine Learning
byXin-She Yang
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning for Data Analysis Using Python
Ebook
Practical Machine Learning for Data Analysis Using Python
byAbdulhamit Subasi
Rating: 0 out of 5 stars
0 ratings
Domain-Specific Knowledge Graph Construction
Ebook
Domain-Specific Knowledge Graph Construction
byMayank Kejriwal
Rating: 0 out of 5 stars
0 ratings
Data and the American Dream: Contemporary Social Controversies and the American Community Survey
Ebook
Data and the American Dream: Contemporary Social Controversies and the American Community Survey
byMatthew J. Holian
Rating: 0 out of 5 stars
0 ratings
Economic Multi Agent Systems: Design, Implementation, and Application
Ebook
Economic Multi Agent Systems: Design, Implementation, and Application
byGottfried Haber
Rating: 4 out of 5 stars
4/5
System Identification: An Introduction
Ebook
System Identification: An Introduction
byKarel J. Keesman
Rating: 0 out of 5 stars
0 ratings
Advanced Deep Learning for Engineers and Scientists: A Practical Approach
Ebook
Advanced Deep Learning for Engineers and Scientists: A Practical Approach
byKolla Bhanu Prakash
Rating: 0 out of 5 stars
0 ratings
A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics
Ebook
A Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics
byGayathri Rajagopalan
Rating: 0 out of 5 stars
0 ratings
Human Activity Recognition and Behaviour Analysis: For Cyber-Physical Systems in Smart Environments
Ebook
Human Activity Recognition and Behaviour Analysis: For Cyber-Physical Systems in Smart Environments
byLiming Chen
Rating: 0 out of 5 stars
0 ratings
Sensory Evaluation Practices
Ebook
Sensory Evaluation Practices
byHerbert Stone
Rating: 0 out of 5 stars
0 ratings
Improving the User Experience through Practical Data Analytics: Gain Meaningful Insight and Increase Your Bottom Line
Ebook
Improving the User Experience through Practical Data Analytics: Gain Meaningful Insight and Increase Your Bottom Line
byMike Fritz
Rating: 0 out of 5 stars
0 ratings
Meta-Learning: Theory, Algorithms and Applications
Ebook
Meta-Learning: Theory, Algorithms and Applications
byLan Zou
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Data Science: Theory and Practice
Ebook
Fundamentals of Data Science: Theory and Practice
byJugal K. Kalita
Rating: 0 out of 5 stars
0 ratings
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
Ebook
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
Machine Learning: A Constraint-Based Approach
Ebook
Machine Learning: A Constraint-Based Approach
byMarco Gori
Rating: 0 out of 5 stars
0 ratings
Mathematical Optimization Terminology: A Comprehensive Glossary of Terms
Ebook
Mathematical Optimization Terminology: A Comprehensive Glossary of Terms
byAndre A. Keller
Rating: 0 out of 5 stars
0 ratings
Distributed Algorithms
Ebook
Distributed Algorithms
byNancy A. Lynch
Rating: 3 out of 5 stars
3/5
Data Analysis in the Cloud: Models, Techniques and Applications
Ebook
Data Analysis in the Cloud: Models, Techniques and Applications
byDomenico Talia
Rating: 0 out of 5 stars
0 ratings
Statistical Analysis of Network Data with R
Ebook
Statistical Analysis of Network Data with R
byEric D. Kolaczyk
Rating: 2 out of 5 stars
2/5
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Beginning Mathematica and Wolfram for Data Science: Applications in Data Analysis, Machine Learning, and Neural Networks
Ebook
Beginning Mathematica and Wolfram for Data Science: Applications in Data Analysis, Machine Learning, and Neural Networks
byJalil Villalobos Alva
Rating: 0 out of 5 stars
0 ratings
Data Mining Algorithms in C++: Data Patterns and Algorithms for Modern Applications
Ebook
Data Mining Algorithms in C++: Data Patterns and Algorithms for Modern Applications
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
Systems Design for Remote Healthcare
Ebook
Systems Design for Remote Healthcare
byKoushik Maharatna
Rating: 0 out of 5 stars
0 ratings
Metacognition: Fundaments, Applications, and Trends: A Profile of the Current State-Of-The-Art
Ebook
Metacognition: Fundaments, Applications, and Trends: A Profile of the Current State-Of-The-Art
byAlejandro Peña-Ayala
Rating: 0 out of 5 stars
0 ratings

Technology & Engineering For You

Skip carousel

Death in Mud Lick: A Coal Country Fight against the Drug Companies That Delivered the Opioid Epidemic
Ebook
Death in Mud Lick: A Coal Country Fight against the Drug Companies That Delivered the Opioid Epidemic
byEric Eyre
Rating: 4 out of 5 stars
4/5
The Art of War
Ebook
The Art of War
bySun Tzu
Rating: 4 out of 5 stars
4/5
Sneaky Uses for Everyday Things: How to Turn a Penny into a Radio, Make a Flood Alarm with an Aspirin, Change Milk into Plastic, Extract Water and Electricity from Thin Air, Turn on a TV with your Ring, and Other Amazing Feats
Ebook
Sneaky Uses for Everyday Things: How to Turn a Penny into a Radio, Make a Flood Alarm with an Aspirin, Change Milk into Plastic, Extract Water and Electricity from Thin Air, Turn on a TV with your Ring, and Other Amazing Feats
byCy Tymony
Rating: 3 out of 5 stars
3/5
The Art of War
Ebook
The Art of War
bySun Tsu
Rating: 4 out of 5 stars
4/5
Vanderbilt: The Rise and Fall of an American Dynasty
Ebook
Vanderbilt: The Rise and Fall of an American Dynasty
byAnderson Cooper
Rating: 4 out of 5 stars
4/5
Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time
Ebook
Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time
byDava Sobel
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
The Big Book of Maker Skills: Tools & Techniques for Building Great Tech Projects
Ebook
The Big Book of Maker Skills: Tools & Techniques for Building Great Tech Projects
byChris Hackett
Rating: 4 out of 5 stars
4/5
A Night to Remember: The Sinking of the Titanic
Ebook
A Night to Remember: The Sinking of the Titanic
byWalter Lord
Rating: 4 out of 5 stars
4/5
The 48 Laws of Power in Practice: The 3 Most Powerful Laws & The 4 Indispensable Power Principles
Ebook
The 48 Laws of Power in Practice: The 3 Most Powerful Laws & The 4 Indispensable Power Principles
byJon Waterlow
Rating: 5 out of 5 stars
5/5
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
Ebook
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
byScott H. Young
Rating: 4 out of 5 stars
4/5
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
Ebook
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
Digital Minimalism - Summarized for Busy People: Choosing a Focused Life in a Noisy World: Based on the Book by Cal Newport
Ebook
Digital Minimalism - Summarized for Busy People: Choosing a Focused Life in a Noisy World: Based on the Book by Cal Newport
byGoldmine Reads
Rating: 4 out of 5 stars
4/5
The Big Book of Hacks: 264 Amazing DIY Tech Projects
Ebook
The Big Book of Hacks: 264 Amazing DIY Tech Projects
byDoug Cantor
Rating: 4 out of 5 stars
4/5
The Right Stuff
Ebook
The Right Stuff
byTom Wolfe
Rating: 4 out of 5 stars
4/5
The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Derivatives Investments, Futures Trading, Digital Assets, NFT)
Ebook
The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Derivatives Investments, Futures Trading, Digital Assets, NFT)
byAntony Lewis
Rating: 4 out of 5 stars
4/5
Electrical Engineering 101: Everything You Should Have Learned in School...but Probably Didn't
Ebook
Electrical Engineering 101: Everything You Should Have Learned in School...but Probably Didn't
byDarren Ashby
Rating: 5 out of 5 stars
5/5
How to Disappear and Live Off the Grid: A CIA Insider's Guide
Ebook
How to Disappear and Live Off the Grid: A CIA Insider's Guide
byJohn Kiriakou
Rating: 0 out of 5 stars
0 ratings
The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026
Ebook
The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026
byMichael Burnette, AF7KB
Rating: 5 out of 5 stars
5/5
Logic Pro X For Dummies
Ebook
Logic Pro X For Dummies
byGraham English
Rating: 0 out of 5 stars
0 ratings
80/20 Principle: The Secret to Working Less and Making More
Ebook
80/20 Principle: The Secret to Working Less and Making More
byPaul J. Stanley
Rating: 5 out of 5 stars
5/5
My Inventions: The Autobiography of Nikola Tesla
Ebook
My Inventions: The Autobiography of Nikola Tesla
byNikola Tesla
Rating: 4 out of 5 stars
4/5
Summary of Nicolas Cole's The Art and Business of Online Writing
Ebook
Summary of Nicolas Cole's The Art and Business of Online Writing
byIRB Media
Rating: 4 out of 5 stars
4/5
Selfie: How We Became So Self-Obsessed and What It's Doing to Us
Ebook
Selfie: How We Became So Self-Obsessed and What It's Doing to Us
byWill Storr
Rating: 4 out of 5 stars
4/5
The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race
Ebook
The Wuhan Cover-Up: And the Terrifying Bioweapons Arms Race
byRobert F. Kennedy, Jr.
Rating: 0 out of 5 stars
0 ratings
Understanding Media: The Extensions of Man
Ebook
Understanding Media: The Extensions of Man
byMarshall McLuhan
Rating: 4 out of 5 stars
4/5
The CIA Lockpicking Manual
Ebook
The CIA Lockpicking Manual
byCentral Intelligence Agency
Rating: 5 out of 5 stars
5/5
Summary of Empire of Pain: by Patrick Radden Keefe - The Secret History of the Sackler Dynasty - A Comprehensive Summary
Ebook
Summary of Empire of Pain: by Patrick Radden Keefe - The Secret History of the Sackler Dynasty - A Comprehensive Summary
byAlexander Cooper
Rating: 3 out of 5 stars
3/5
Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
Broken Money: Why Our Financial System is Failing Us and How We Can Make it Better
Ebook
Broken Money: Why Our Financial System is Failing Us and How We Can Make it Better
byLyn Alden
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
Podcast episode
4 + 1 Model of Data Science: Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning. In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado. Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado: “The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The fol
byUVA Data Points
0 ratings
0% found this document useful
DIY AI-based image analysis for pathology. How DeePathology incorporated respect for pathologists' time into their software w/ Chen Sagiv
Podcast episode
DIY AI-based image analysis for pathology. How DeePathology incorporated respect for pathologists' time into their software w/ Chen Sagiv
byDigital Pathology Podcast
0 ratings
0% found this document useful
Cognition Ignition – Andrej Nikonov, CEO at Cognostics AG – Advancements in Learning Methodology Via Machine Learning, AI Strategies and Tools: Andrej Nikonov, CEO at Cognostics AG (<a href="http://cognostics.de">cognostics.de</a>), discusses the science of thinking and the many ways AI can expand human knowledge and capability. Andrej Nikonov, as CEO, is driven to discover new...
Podcast episode
Cognition Ignition – Andrej Nikonov, CEO at Cognostics AG – Advancements in Learning Methodology Via Machine Learning, AI Strategies and Tools: Andrej Nikonov, CEO at Cognostics AG (<a href="http://cognostics.de">cognostics.de</a>), discusses the science of thinking and the many ways AI can expand human knowledge and capability. Andrej Nikonov, as CEO, is driven to discover new...
byFinding Genius Podcast
0 ratings
0% found this document useful
10. Unlocking Contract Intelligence: The Intersection of AI and Transformative Mathematics with Randy Friedman: The CLM Rx
Podcast episode
10. Unlocking Contract Intelligence: The Intersection of AI and Transformative Mathematics with Randy Friedman: The CLM Rx
byThe CLM Rx
0 ratings
0% found this document useful
An easy AI tool for pathology image analysis. How Aiforia empowers pathologists and scientists with supervised deep learning w/ Tuomas Ropponen, Aiforia
Podcast episode
An easy AI tool for pathology image analysis. How Aiforia empowers pathologists and scientists with supervised deep learning w/ Tuomas Ropponen, Aiforia
byDigital Pathology Podcast
0 ratings
0% found this document useful
Neurosalience #S4E13 with Dr. Daniele Marinazzo - Networks, causality, new ideas for advancing the field
Podcast episode
Neurosalience #S4E13 with Dr. Daniele Marinazzo - Networks, causality, new ideas for advancing the field
byOHBM Neurosalience
0 ratings
0% found this document useful
24. Xander Steenbrugge - Machine learning as a creative tool, and the quest for artificial general intelligence
Podcast episode
24. Xander Steenbrugge - Machine learning as a creative tool, and the quest for artificial general intelligence
byTowards Data Science
0 ratings
0% found this document useful
063 - Why do we need a handbook of fire and the environment with Brian Meacham and Margaret McNamee
Podcast episode
063 - Why do we need a handbook of fire and the environment with Brian Meacham and Margaret McNamee
byFire Science Show
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
Podcast episode
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful
42. Will Grathwohl - Energy-based models and the future of generative algorithms
Podcast episode
42. Will Grathwohl - Energy-based models and the future of generative algorithms
byTowards Data Science
0 ratings
0% found this document useful
Why and how is AI taking over the tissue image analysis field? w/ Jeppe Thagaard, Visiopharm
Podcast episode
Why and how is AI taking over the tissue image analysis field? w/ Jeppe Thagaard, Visiopharm
byDigital Pathology Podcast
0 ratings
0% found this document useful
Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk: Concrete ML safety problems and their relevance to x-risk
Podcast episode
Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk: Concrete ML safety problems and their relevance to x-risk
byAlignment Newsletter Podcast
0 ratings
0% found this document useful
Episode 281: Nursing’s Role in AI in Health Care: “I think the horizon, the trends that we are seeing today, are indicating that this technology is just going to explode and be integrated into everything we do in nursing or beyond. Many of the things with nursing are going to change significantly...
Podcast episode
Episode 281: Nursing’s Role in AI in Health Care: “I think the horizon, the trends that we are seeing today, are indicating that this technology is just going to explode and be integrated into everything we do in nursing or beyond. Many of the things with nursing are going to change significantly...
byThe Oncology Nursing Podcast
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
Podcast episode
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in Mathematics
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
Podcast episode
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
Podcast episode
Thomas Huckle and Tobias Neckel, "Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science" (SIAM, 2019): An interview with Thomas Huckle and Tobias Neckel
byNew Books in the History of Science
0 ratings
0% found this document useful
Exploring Open-Source for Tissue Image Analysis and Data Science Business w/ Trevor McKee, Pathomics.io
Podcast episode
Exploring Open-Source for Tissue Image Analysis and Data Science Business w/ Trevor McKee, Pathomics.io
byDigital Pathology Podcast
0 ratings
0% found this document useful
How to design products that delight your users with Slack’s VP of design Ethan Eismann: On this episode Abadesi talks to Ethan Eismann, VP of Design at Slack. He has previously worked on flagship products at Google, Uber, and Airbnb, as well as at Adobe back when Flash was still a thing! In this episode they talk about... * The consumerization of the enterprise and bringing personality to software. * The design philosophy at Slack and how they use hypotheses in designing their products. * Customer-centric design and what it means to communicate energy as well as information. We’ll be back next week so be sure to subscribe on Apple Podcasts, Google Podcasts, Spotify, Breaker, Overcast, or wherever you listen to your favorite podcasts. Big thanks to Headspin, Safety Wing, and Trulioo for their support.
Podcast episode
How to design products that delight your users with Slack’s VP of design Ethan Eismann: On this episode Abadesi talks to Ethan Eismann, VP of Design at Slack. He has previously worked on flagship products at Google, Uber, and Airbnb, as well as at Adobe back when Flash was still a thing! In this episode they talk about... * The consumerization of the enterprise and bringing personality to software. * The design philosophy at Slack and how they use hypotheses in designing their products. * Customer-centric design and what it means to communicate energy as well as information. We’ll be back next week so be sure to subscribe on Apple Podcasts, Google Podcasts, Spotify, Breaker, Overcast, or wherever you listen to your favorite podcasts. Big thanks to Headspin, Safety Wing, and Trulioo for their support.
byProduct Hunt Radio
0 ratings
0% found this document useful
Podcast Ep. #18 – Prof. Wenbin Yu on the Structure Genome: On this episode I am speaking to Wenbin Yu, who is a professor at the School of Aeronautics and Astronautics of Purdue University and CTO of AnalySwift, a provider of simulation software for composites. Wenbin has achieved many accolades in both the ac...
Podcast episode
Podcast Ep. #18 – Prof. Wenbin Yu on the Structure Genome: On this episode I am speaking to Wenbin Yu, who is a professor at the School of Aeronautics and Astronautics of Purdue University and CTO of AnalySwift, a provider of simulation software for composites. Wenbin has achieved many accolades in both the ac...
byAerospace Engineering Podcast
0 ratings
0% found this document useful
104 - Experiments that will change fire science pt. 6 - MaCFP with Arnaud Trouve
Podcast episode
104 - Experiments that will change fire science pt. 6 - MaCFP with Arnaud Trouve
byFire Science Show
0 ratings
0% found this document useful
The Future of Search in the Era of Large Language Models // Saahil Jain // MLOps Podcast #150
Podcast episode
The Future of Search in the Era of Large Language Models // Saahil Jain // MLOps Podcast #150
byMLOps.community
0 ratings
0% found this document useful
ProductizeML: Assisting Your Team to Better Build ML Products // Adrià Romero // MLOps Meetup #47
Podcast episode
ProductizeML: Assisting Your Team to Better Build ML Products // Adrià Romero // MLOps Meetup #47
byMLOps.community
0 ratings
0% found this document useful
Automating Analytics Teams
Podcast episode
Automating Analytics Teams
byThe Cloudcast
0 ratings
0% found this document useful
The Computational Complexity of Machine Learning: In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave an...
Podcast episode
The Computational Complexity of Machine Learning: In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave an...
byData Skeptic
0 ratings
0% found this document useful
ML Observability
Podcast episode
ML Observability
byThe Cloudcast
0 ratings
0% found this document useful
Does Not Compute: Scientific journal articles have a lot of numbers. Scientists are smart people with even smarter computers, so an outsider might think that, if nothing else, you can count on the math checking out. But modern data analysis is complicated, and computation...
Podcast episode
Does Not Compute: Scientific journal articles have a lot of numbers. Scientists are smart people with even smarter computers, so an outsider might think that, if nothing else, you can count on the math checking out. But modern data analysis is complicated, and computation...
byThe Black Goat
0 ratings
0% found this document useful
An AI Hammer in Search of a Nail
Podcast episode
An AI Hammer in Search of a Nail
byHow to Fix the Internet
0 ratings
0% found this document useful
Derwen, Inc. with Paco Nathan: This week, Jon and Michelle bring you another fascinating interview from our time at Next!
Podcast episode
Derwen, Inc. with Paco Nathan: This week, Jon and Michelle bring you another fascinating interview from our time at Next!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
Podcast episode
Four Most Commonly Asked Questions About AI with Dr. Jerry Smith: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is talking today about questions and...
byAI Live & Unbiased
0 ratings
0% found this document useful
AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens
Podcast episode
AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens
byMicrosoft Research Podcast
0 ratings
0% found this document useful

Skip carousel

How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
How To Make Sense From And With AI ?
The European Business Review
Article
How To Make Sense From And With AI ?
Sep 25, 2021
4 min read
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Nautilus
Article
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
How To Train Computers Faster For ‘Extreme’ Datasets
Futurity
Article
How To Train Computers Faster For ‘Extreme’ Datasets
Dec 12, 2019
4 min read
Is Artificial Intelligence Permanently Inscrutable?
Nautilus
Article
Is Artificial Intelligence Permanently Inscrutable?
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
Researchers Gain New Understanding From Simple AI
Nautilus
Article
Researchers Gain New Understanding From Simple AI
Apr 15, 2022
In the last two years, artificial intelligence programs have reached a surprising level of linguistic fluency. The biggest and best of these are all based on an architecture invented in 2017 called the transformer. It serves as a kind of blueprint fo
5 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
THE AI DILEMMA: Uniting Four Logics of Power
Rotman Management
Article
THE AI DILEMMA: Uniting Four Logics of Power
Jan 1, 2024
11 min read
Education 2.0: The Destructive Reconstruction of Higher Learning
Rotman Management
Article
Education 2.0: The Destructive Reconstruction of Higher Learning
Jan 1, 2018
8 min read
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nautilus
Article
What Tech Can Learn from the Fruit Fly’s Search Algorithm
Nov 13, 2017
5 min read
Things Get Strange When AI Starts Training Itself
The Atlantic
Article
Things Get Strange When AI Starts Training Itself
Feb 16, 2024
7 min read
What Will Our Lives Be Like as Cyborgs?
The Atlantic
Article
What Will Our Lives Be Like as Cyborgs?
Oct 27, 2017
8 min read
Finding A New Career In AI
APC
Article
Finding A New Career In AI
Mar 23, 2020
4 min read
The Future Is Now
Palm Beach Illustrated
Article
The Future Is Now
Aug 19, 2019
5 min read
Four Paths to Opportunity Identification
Rotman Management
Article
Four Paths to Opportunity Identification
Jan 1, 2019
In our work teaching innovation and entrepreneurship to students at the University of Sydney Business School and the California College of the Arts, we focus on four cognitive acts that comprise ‘design cognition’ — the type of thinking that fuels op
4 min read
Learning Code
India Today
Article
Learning Code
Feb 1, 2020
2 min read
Human-centred Design
Facility Management
Article
Human-centred Design
Dec 23, 2018
There was a recent report in The Sydney Morning Herald that was very misleading in its representation of the discipline of ergonomics. Citing the testimony of an authority on back pain, Sydney University professor, Chris Maher, the piece fundamentall
4 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
Working With, Not Against, Machines
Finweek - English
Article
Working With, Not Against, Machines
Feb 14, 2020
3 min read
All Those Open Browser Tabs Are Stressing You Out, Study Says
PCWorld
Article
All Those Open Browser Tabs Are Stressing You Out, Study Says
Jun 6, 2023
2 min read
Quantum Jump
Business Today
Article
Quantum Jump
Dec 25, 2018
2 min read
In Rise Of Brain Implants, Blurring Lines Between Man, Machine?
The Christian Science Monitor
Article
In Rise Of Brain Implants, Blurring Lines Between Man, Machine?
Jul 19, 2019
Elon Musk and his company Neuralink see the prospect of humans “merging” with artificial intelligence, as brain-implant technology improves.
4 min read
To Spur Innovation, Teach A.I. to Find Analogies
Futurity
Article
To Spur Innovation, Teach A.I. to Find Analogies
Aug 18, 2017
A method for teaching artificial intelligence analogies through crowdsourcing could allow a computer to search data for comparisons between disparate problems and solutions, highlighting important—but potentially unrecognized—underlying similarities.
2 min read
Neural Pathways
Guitar Magazine
Article
Neural Pathways
Jul 2, 2021
5 min read
Tired Of AI Doomsday Tropes, Cohere CEO Says His Goal Is Technology That’s ‘Additive To Humanity’
AppleMagazine
Article
Tired Of AI Doomsday Tropes, Cohere CEO Says His Goal Is Technology That’s ‘Additive To Humanity’
Mar 29, 2024
4 min read
Tired Of AI Doomsday Tropes, Cohere CEO Says His Goal Is Technology That’s ‘Additive To Humanity’
TechLife News
Article
Tired Of AI Doomsday Tropes, Cohere CEO Says His Goal Is Technology That’s ‘Additive To Humanity’
Mar 30, 2024
4 min read
Even The Best Artificial Intelligence Has Weaknesses
Futurity
Article
Even The Best Artificial Intelligence Has Weaknesses
Jan 16, 2024
New research tries to reveal the weaknesses in artificial intelligence. Machines interpret medical scanning images more accurately than doctors, they translate foreign languages, and may soon be able to drive cars more safely than humans. However, ev
2 min read
As AI Language Skills Grow, So Do Scientists' Concerns
The Independent
Article
As AI Language Skills Grow, So Do Scientists' Concerns
Jul 17, 2022
5 min read

Related categories

Skip carousel

Reviews for Recent Advances in Ensembles for Feature Selection

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Recent Advances in Ensembles for Feature Selection - Verónica Bolón-Canedo

Verónica Bolón-Canedo and Amparo Alonso-BetanzosRecent Advances in Ensembles for Feature SelectionIntelligent Systems Reference Library147https://doi.org/10.1007/978-3-319-90080-3_1

1. Basic Concepts

Verónica Bolón-Canedo¹ and Amparo Alonso-Betanzos¹

(1)

Facultad de Informática, Universidade da Coruña, A Coruña, Spain

Verónica Bolón-Canedo

Email: vbolon@udc.es

Abstract

In the new era of Big Data, the analysis of data is more important than ever, in order to extract useful information. Feature selection is one of the most popular preprocessing techniques used by machine learning researchers, aiming to find the relevant features of a problem. Since the best feature selection method does not exist, a possible approach is to use an ensemble of feature selection methods, which is the focus of this book. But, before diving into the specific aspects to consider when building an ensemble of feature selectors, in this chapter we will go back to the basics in an attempt to provide the reader with basic concepts such as the definition of a dataset, feature and class (Sect. 1.1). Then, Sect. 1.2 comments on measures to evaluate the performance of a classifier, whilst in Sect. 1.3 different approaches to divide the training set are discussed. Finally, Sect. 1.4 gives some recommendations on statistical tests adequate to compare several models and in Sect. 1.5 the reader can find some database repositories.

This book is devoted to explore the recent advances in ensemble feature selection. Feature selection is the process of selecting the relevant features and discarding the irrelevant ones but, since the best feature selection method does not exist, a possible solution is to use an ensemble of multiple methods. But, before entering into specific details when dealing with ensemble feature selection, this chapter will start by defining basic concepts that will be necessary to understand the more advanced issues that will be discussed throughout this book.

1.1 What Is a Dataset, Feature and Class?

This introductory chapter starts by defining a cornerstone in the field of Data Analysis: the data itself. In the last few years, human society collects and stores vast amounts of information about every subject imaginable, leading to the appearance of the term Big Data. More than ever, data scientists are now in need, aiming at extracting useful information from a vast pile of row data. But let’s start from the beginning...What is data?

Data is usually collected by researchers in a form of a dataset. A dataset can be defined as a collection of individual data, often called samples, instances or patterns. A sample can be seen as information about a particular case, for example about a medical patient. The information about this particular case is given in the form of features or attributes. A feature might be the sex of the patient, his/her blood pressure or the color of his/her eyes. A feature can be relevant or not, or even redundant with others, but this issue will be explored in depth in Chap. 2.

A specific task in Data Analysis is called classification, which consists of assigning each sample to a specific class or category. Typically, samples belonging to the same class have similar features and samples belonging to different classes are dissimilar. A simple example can be seen in Table 1.1, which represents the popular play tennis dataset [1].

Table 1.1

Play tennis dataset

As can be seen, this toy example represents data for a total of 15 records or samples, and each sample has four different features (outlook, temperature, humidity and windy) which give information that can be useful to know if it is possible to play tennis or not (given that tennis is a sport that is played outside). The last column represents the prediction variable or class (play), which is the desirable outcome of this dataset, in a typical classification scenario. A feature can be discrete (if it takes a finite set of possible values), continuous (if it takes a numerical value) or boolean (if it takes one of two possible values—for example 0 or 1), and in some cases it is necessary to discretize the continuous values since some machine learning algorithms can only work with discrete data. In the play tennis dataset, features outlook and temperature are discrete, whilst features humidity, windy and the class play are boolean (notice that a boolean feature is a particular case of a discrete feature).

More formally, we can represent a dataset as

$$\mathbf {X} = \{\mathbf {x_1}, \dots , \mathbf {x_d}\} \in \mathbb {R}$$

. The class label is represented as

$$\mathbf {Y} = \{\mathbf {y_1}, \dots , \mathbf {y_N}\}$$

. A typical dataset is organized as a matrix of N rows (samples) by d columns (features), plus an extra column with the class labels:

$$ \mathbf {X} = \begin{bmatrix} x_{11},&x_{12},&\dots&x_{1d} \\ x_{21},&x_{22},&\dots&x_{2d} \\ \vdots&&\\ x_{N1},&x_{N2},&\dots&x_{Nd} \\ \end{bmatrix} \mathbf {Y} = \begin{bmatrix} y_{1}\\ y_{2}\\ \vdots \\ y_{N}\\ \end{bmatrix} $$

Notice that the element $$x_{ji}$$ contains the value for the ith feature of the jth sample.

One of the most popular datasets that can be found in the Pattern Recognition literature is the Iris dataset [2]. This dataset has been used in thousands of publications over the years, and consists of distinguishing among three classes of iris plant (setosa, virginica and versicolor). The dataset has four features which are petal width and length, and sepal width and length and 50 samples of each of the three classes. As can be seen in Fig. 1.1, one of the classes (setosa) is linearly and clearly separable from the other two, while the classes virginica and versicolor are not linearly separable between them. Notice that in Fig. 1.1 we are displaying feature petal width versus petal length, but this situation on separability occurs for each pairwise combination of features.

../images/441215_1_En_1_Chapter/441215_1_En_1_Fig1_HTML.gif

Fig. 1.1

Scatter plot of Iris dataset

Having features that are linearly separable leads to perfect classification accuracies, while when the classes are not separable it is possible that the classifiers make some mistakes. This issue will be commented in detail in the next section.

1.2 Classification Error/Accuracy

Although this book is focused on feature selection, a typical measure to evaluate the efficiency of the features selected by a feature selection algorithm is to use a classifier afterwards and check if the classification error/accuracy remains acceptable.

Just to recall, the task of a classifier is to predict to which class belongs a particular sample. Therefore, we need measures to evaluate how good this prediction was. A very popular performance measure is the classification error, which is the percentage of incorrectly classified instances divided by the total number of instances. Analogously, classification accuracy is the percentage of correctly classified instances divided by the total number of samples.

However, looking only at the classification error or accuracy is not a good practice. Suppose that we have a dataset with 100 samples, 95 of them belonging to class A and only five of them belonging to class B. Imagine now that we have two classifiers, $$C_1$$ and $$C_2$$ . The first classifier, $$C_1$$ , just assigns all the samples to class A, achieving $$95\%$$ of accuracy, which sounds fairly high. Then, the second classifier, $$C_2$$ , misclassifies four samples belonging to class A and two samples belonging to class B, obtaining $$94\%$$ of accuracy. Which classifier is better? Well, the answer depends on the nature of the dataset but, in general, it is better to achieve a trade-off between the performance on the two classes, and so it is necessary to check the classification rates of each class. In a typical binary classification scenario, there are other measures that we can use to evaluate the performance of a classifier, which are represented below. Notice that accuracy and error can be redefined in terms of these new measures.

True positive (TP): percentage of positive examples correctly classified as so.

False positive (FP): percentage of negative examples incorrectly classified as positive.

True negative (TN): percentage of negative examples correctly classified as so.

False negative (FN): percentage of positive examples incorrectly classified as negative.

$$Sensitivity = \frac{TP}{TP + FN}$$

$$Specificity = \frac{TN}{TN + FP}$$

$$Accuracy = \frac{TN + TP}{TN + TP + FN + FP}$$

$$Error = \frac{FN + FP}{TN + TP + FN + FP}$$

Another way to check how the errors are distributed across the classes (particularly interesting if the dataset has more than two classes) is to construct a confusion matrix. An entry $$a_{ij}$$ of this matrix represents the number of samples that have been assigned to class $$c_j$$ while their true class was $$c_i$$ . To calculate the classification accuracy from this matrix it is necessary to divide the sum of the elements in the main diagonal divided by the total number of examples. Using the confusion matrix is very useful because it gives additional information on where the errors have occurred.

For example, suppose that we have classified the Iris dataset with a linear discriminant [2]. In Fig. 1.2 we can see that, as expected, the class setosa is correctly classified but there are some errors in the classification of the other two classes. In particular, the confusion matrix presented in Table 1.2 gives us more explicit information about the errors.

../images/441215_1_En_1_Chapter/441215_1_En_1_Fig2_HTML.gif

Fig. 1.2

Scatter plot of Iris dataset being classified with a linear discriminant

Table 1.2

Confusion matrix for Iris dataset classified with a linear discriminant

1.3 Training and Testing

In the previous section, we have seen, as an example, how the Iris data was classified. But what happens when a new sample arrives? This is the essence of classification, being able to classify new examples for which the class label is not known, and in this way test our classification model.

Ideally, one would use all the labeled examples available to train a classifier, making it possible that it can learn the particularities of the data and the relationship between the feature values and the corresponding class. Then, as new unlabeled examples would come, our trained classifier makes a prediction but, how can we know if our classifier was correctly trained with data being representative enough of the full population? In the real world, with new unlabeled examples, it is impossible to answer this question. So, a common practice is to save part of the labeled data to act as the test set. Notice that it is very important that testing is done on unseen data. An important aspect we need to take into account is overfitting, which might occur when the learning is so adjusted to the training data that is incapable of generalize to unseen test data. Therefore, in practice, it is common to use some technique to lessen the amount of overfitting, such as cross-validation (that will be commented later in this section), regularization, early stopping, pruning, etc. All these techniques are based on either explicitly penalize overly complex models or to test the ability of the model to generalize by evaluating its performance in unseen data.

All the parameters involved in learning should be tuned on the training data, and this includes feature selection. A commonly found mistake in the specialized literature is that feature selection is performed on all the available data, discarding the irrelevant features, and then continue with the training of the classifier dividing data into training and test sets to evaluate the accuracy of the selected features. This is incorrect, since feature selection (and any other type of learning or parameter tuning that is performed on data) should be done only on the training set, leaving a test set to evaluate the performance.

There are some benchmark datasets that come originally divided into training and test sets. For example, the KDD (Knowledge Discovery and Data Mining Tools Conference) Cup 99 dataset is a benchmark for intrusion detection systems. Separate training and test sets were released, with the particularities that the percentage of the different classes (normal connection and several types of attacks) varies significantly from training to test, as well as the fact that in the test set there are new attacks that are not present in the training set [3].

In other cases, researchers need to keep part of the available data as test set. There are several training/testing protocols that can be done, the most popular ones are following described:

k-Fold Cross-validation. This is one of the most famous validation techniques [4]. The data (D) is partitioned into k nonoverlapping subsets

$$D_{1},\dots ,D_{k}$$

of roughly equal size. The learner is trained on $$k-1$$ of these subsets combined together and then applied to the remaining subset to obtain an estimate of the prediction error. This process is repeated in turn for each of the k subsets, and the cross-validation error is given by the average of the k estimates of the prediction error thus obtained. In the case of feature selection, note that with this method there will be k subsets of selected features. A common practice is to merge the k different subsets (either by union or by intersection) or to keep the subset obtained in the fold with the best classification result.

Leave-One-Out Cross-validation. This is a variant of k-fold cross validation where k is the number of samples [4]. A single observation is left out each time.

Bootstrap. This is a general resampling strategy [5]. A bootstrap sample consists of n samples equally likely to be drawn, with replacement, from the original data. Therefore, some of the samples will appear multiple times, whereas others will not appear at all. The learner is designed on the bootstrap sample and tested on the left-out data points. The error is approximated by a sample mean based on independent replicates (usually between 25 and 200). Some famous variants of this method exist, such as balanced bootstrap or 0.632 bootstrap [6]. As in the previous methods, there will be as many subsets of features as repetitions of the method.

Holdout Validation. This technique consists of randomly splitting the available data into a disjoint pair training test [4]. A common partition is to use 2/3 for training and 1/3 for testing. The learner is designed based on the training data and the estimated error rate is the proportion of errors observed in the test data. This approach is usually employed when some of the datasets in a study come originally divided into training and test sets whilst others do not. In contrast to other validation techniques, a unique set of selected features is obtained.

The choice of one or another method is not trivial, and it usually depends on the size of the data we have. For example, if only a hundred of samples are available (as usually happens with microarray data), choosing a 2/3-1/3 hold validation might not be a good idea, since the training data might not be enough to avoid overfitting [7]. On the contrary, when the data is really large (as it happens nowadays since the advent of Big Data), using a 10-fold cross validation or leave-one-out can result in an excessively time-consuming process, so people tend to go back to the old good hold-out method [8]. Moreover, using a scheme that produces multiple training and testing pairs, there is the open question of which of the models built during the training process should be used in the end. For example, imagine that you have used a 10-fold cross validation to perform feature selection and evaluated the performance of the selected features in terms of classification accuracy. You end up with ten —possibly— different subsets of features, and then...which one would you use as your final set of relevant features? There is not a perfect solution to this problem, some approaches consist of choosing the one which obtains the highest classification accuracy, while others employ the union or intersection between all the ten subsets of features.

1.4 Comparison of Models: Statistical Tests

When presenting a new feature selection or classification method, it is necessary to compare it with previous state-of-the-art approaches, to demonstrate if the method is sound. For example, if one wants to demonstrate that applying feature selection is useful in a particular domain, a common practice is to compare the classification performance with and without feature selection, expecting that feature selection —at least— maintains the original performance but using less features. Therefore, to know if the differences between models are important, statistical tests are usually employed.

When comparing models, there is a set of good practices that is advisable to follow, based on those given by Kuncheva [8]:

Choose carefully the training/test procedure (see previous section) before starting the experiments. When you publish your work, give enough details so the experiments are clear and can be reproducible.

Make sure that all the models (either if we are comparing feature selection methods or classifiers) use all the information possible, and of course they employ the same data for training, and then for testing. For example, it is not fair to perform different 10-fold cross validations for different models, because the random division of the data may favor one or another method. The correct way to do this is to divide the data into folds and at each iteration train the different models on the corresponding training data.

Make sure that the data reserved for testing was not used before in any training stage.

When possible, perform statistical tests. It is better for the reader to know if the differences in performance between models are statistically significant or not.

There are several statistical tests available in the specialized literature; in the following we will describe the most adequate ones for a particular situation based on the recommendations given by Demšar [9].

1.4.1 Two Models and a Single Dataset

Suppose that you have a fixed, single dataset and two algorithms (for example, the same classifier with and without feature selection as a previous step). If we want to have some repetitions to be able to perform statistical tests, it is necessary to repeatedly split the data into training and testing set, and induce our models. For example, a typical choice might be a 10-fold cross validation. Unfortunately, under this situation, it is not possible to apply the classical Student’s t-test for paired samples, since this method assumes that the samples need to be independent, and in a cross-validation they are not (two training sets in a 10-fold cross-validation share 80/90% of data instances). To solve this, there are several options:

The corrected t-test presented by Nadeau and Bengio [10] which corrects the bias presenting a new way to compute the variance.

The McNemar’s test.

Dietterich [11] proposed to perform a 5 $$\times $$ 2 cross-validation. In each 2-fold cross-validation, different data is used for training and testing, so we can assume that the variances are unbiased. Since it is computed in a really small sample (2), Dietterich proposed to repeat this process five times.

1.4.2 Two Models and Multiple Dataset

Given that we have access to data repositories such as the UCI Machine Learning Repository, the tendency is to use several datasets to demonstrate that our new method is better than other, for example that using feature selection before classifying is better than just classifying. The prevalent approach some years ago to compare two models was to count wins and losses. However, how can we know if an algorithm really wins? If our model A wins in 15 datasets and loses only in two, we can say it but, what if they were 10:7? Notice that our samples, in this case, are the number of datasets tested, so this is a really small sample size, making it difficult to draw meaningful conclusions.

Demšar discourages us to use sign tests, as they discard too much information. They only take into account the signs (of differences) but they do not consider the margins by which each model wins. So, in this situation, Demšar proposes the use of Wilcoxon signed rank test [12].

1.4.3 Multiple Models and Multiple Dataset

Another typical scenario is when you have multiple feature selection algorithms which you want to apply before classifying and you want to know which one is the best. According to Demšar, repeating the Wilcoxon test for all pairs is not a good idea, since it is something that you should avoid in significant testing, specially because your sample size is the number of datasets and it is not large enough. Thus, he suggested the use of the Friedman test [13, 14]. This test only tells if the performances of your models differ, so you need a post-hoc test. There are two possible situations: you either compare multiple algorithms, or you compare your novel method (control case) with several existing algorithms. In the first case, you have

$$k(k+1)/2$$

comparisons (being k the number of models) and Demšar suggested the use of the pairwise Nemenyi case [15]. In the second case, you test $$k-1$$ hypotheses (yours vs. every other) and Demšar suggested the use of the Bonferroni–Dunn test [16].

1.5 Data Repositories

Nowadays, there are several data repositories with benchmark datasets in which researchers can find a diverse set of databases to test their novel methods. The most popular ones are listed below:

The UC Irvine Machine Learning Repository (UCI), from University of California, Irvine:

http://archive.ics.uci.edu/ml/

UCI KDD Archive, from University of California, Irvine:

http://kdd.ics.uci.edu

LIBSVM Database:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Public Data Sets, from Amazon Web Services:

http://aws.amazon.com/datasets

The Datahub:

http://datahub.io/dataset

Kaggle datasets:

https://www.kaggle.com/datasets

There also exist specialized repositories, for example for microarrays (with the particularity of having much more features than samples) or images.

ImageNet, the most popular collection of public images:

https://www.kaggle.com/datasets

ArrayExpress, microarray datasets from the European Bioinformatics Institute:

http://www.ebi.ac.uk/arrayexpress/

Gene Expression Omnibus, microarray datasets from the National Institutes of Health:

http://www.ncbi.nlm.nih.gov/geo/

The Cancer Genome Atlas (TCGA), microarray datasets from both the National Cancer Institute and the National Human Genome Research Institute:

https://cancergenome.nih.gov/

Cancer Program Data Sets, microarray datasets from the Broad Institute:

http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

Gene Expression Model Selector, microarray datasets from Vanderbilt University:

http://www.gems-system.org

Gene Expression Project, microarray datasets from Princeton University:

http://genomics-pubs.princeton.edu/oncology/

1.6 Summary

Feature selection is one of the most popular preprocessing techniques, which consists of selecting the relevant features and discarding the irrelevant and redundant ones. Researchers agree that the best feature selection method does not exist, so a good option might be to combine the outcomes of different selectors, which is known as ensemble feature selection. Before exploring in detail this approach, which will be exhaustively tackled throughout this book,

Enjoying the preview?

Page 1 of 1

Recent Advances in Ensembles for Feature Selection

About this ebook

Verónica Bolón-Canedo

Related authors

Related to Recent Advances in Ensembles for Feature Selection

Titles in the series (4)

Related ebooks

Technology & Engineering For You

Related podcast episodes

Related articles

Related categories

Reviews for Recent Advances in Ensembles for Feature Selection

What did you think?

Book preview

Recent Advances in Ensembles for Feature Selection - Verónica Bolón-Canedo

1. Basic Concepts

Abstract

1.1 What Is a Dataset, Feature and Class?

1.2 Classification Error/Accuracy

1.3 Training and Testing

1.4 Comparison of Models: Statistical Tests

1.4.1 Two Models and a Single Dataset

1.4.2 Two Models and Multiple Dataset

1.4.3 Multiple Models and Multiple Dataset

1.5 Data Repositories

1.6 Summary