Ebook565 pages5 hours

Text Mining with MATLAB®

Name: Text Mining with MATLAB®
Author: Rafael E. Banchs
ISBN: 9781461441519

By Rafael E. Banchs

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Text Mining with MATLAB provides a comprehensive introduction to text mining using MATLAB. It’s designed to help text mining practitioners, as well as those with little-to-no experience with text mining in general, familiarize themselves with MATLAB and its complex applications.

The first part provides an introduction to basic procedures for handling and operating with text strings. Then, it reviews major mathematical modeling approaches. Statistical and geometrical models are also described along with main dimensionality reduction methods. Finally, it presents some specific applications such as document clustering, classification, search and terminology extraction.

All descriptions presented are supported with practical examples that are fully reproducible. Further reading, as well as additional exercises and projects, are proposed at the end of each chapter for those readers interested in conducting further experimentation.

Skip carousel

LanguageEnglish

PublisherSpringer

Release dateAug 14, 2012

ISBN9781461441519

Author

Rafael E. Banchs

Related authors

Skip carousel

Related to Text Mining with MATLAB®

Related ebooks

Skip carousel

Practical MATLAB: With Modeling, Simulation, and Processing Projects
Ebook
Practical MATLAB: With Modeling, Simulation, and Processing Projects
byIrfan Turk
Rating: 0 out of 5 stars
0 ratings
Essential MATLAB for Engineers and Scientists
Ebook
Essential MATLAB for Engineers and Scientists
byDaniel T. Valentine
Rating: 2 out of 5 stars
2/5
A MATLAB® Primer for Technical Programming for Materials Science and Engineering
Ebook
A MATLAB® Primer for Technical Programming for Materials Science and Engineering
byLeonid Burstein
Rating: 5 out of 5 stars
5/5
Experiments and Modeling in Cognitive Science: MATLAB, SPSS, Excel and E-Prime
Ebook
Experiments and Modeling in Cognitive Science: MATLAB, SPSS, Excel and E-Prime
byFabien Mathy
Rating: 0 out of 5 stars
0 ratings
An Introduction to MATLAB® Programming and Numerical Methods for Engineers
Ebook
An Introduction to MATLAB® Programming and Numerical Methods for Engineers
byTimmy Siauw
Rating: 0 out of 5 stars
0 ratings
Beginning MATLAB and Simulink: From Novice to Professional
Ebook
Beginning MATLAB and Simulink: From Novice to Professional
bySulaymon Eshkabilov
Rating: 0 out of 5 stars
0 ratings
Matlab: A Practical Introduction to Programming and Problem Solving
Ebook
Matlab: A Practical Introduction to Programming and Problem Solving
byDorothy C. Attaway
Rating: 4 out of 5 stars
4/5
Beginning Mathematica and Wolfram for Data Science: Applications in Data Analysis, Machine Learning, and Neural Networks
Ebook
Beginning Mathematica and Wolfram for Data Science: Applications in Data Analysis, Machine Learning, and Neural Networks
byJalil Villalobos Alva
Rating: 0 out of 5 stars
0 ratings
Digital Signal Processing System Design: LabVIEW-Based Hybrid Programming
Ebook
Digital Signal Processing System Design: LabVIEW-Based Hybrid Programming
byNasser Kehtarnavaz
Rating: 5 out of 5 stars
5/5
Practical Machine Learning in JavaScript: TensorFlow.js for Web Developers
Ebook
Practical Machine Learning in JavaScript: TensorFlow.js for Web Developers
byCharlie Gerard
Rating: 0 out of 5 stars
0 ratings
Introduction to Audio Analysis: A MATLAB® Approach
Ebook
Introduction to Audio Analysis: A MATLAB® Approach
byTheodoros Giannakopoulos
Rating: 5 out of 5 stars
5/5
Practical Java Machine Learning: Projects with Google Cloud Platform and Amazon Web Services
Ebook
Practical Java Machine Learning: Projects with Google Cloud Platform and Amazon Web Services
byMark Wickham
Rating: 0 out of 5 stars
0 ratings
GPU Programming in MATLAB
Ebook
GPU Programming in MATLAB
byNikolaos Ploskas
Rating: 1 out of 5 stars
1/5
Mastering Hibernate
Ebook
Mastering Hibernate
byRamin Rad
Rating: 0 out of 5 stars
0 ratings
Learning Reactive Programming with Java 8
Ebook
Learning Reactive Programming with Java 8
byNickolay Tsvetinov
Rating: 0 out of 5 stars
0 ratings
Matplotlib for Python Developers
Ebook
Matplotlib for Python Developers
bySandro Tosi
Rating: 3 out of 5 stars
3/5
Programming Concepts in C++
Ebook
Programming Concepts in C++
byRobert Burns
Rating: 0 out of 5 stars
0 ratings
Learning .NET High-performance Programming
Ebook
Learning .NET High-performance Programming
byAntonio Esposito
Rating: 0 out of 5 stars
0 ratings
Numerical Python: A Practical Techniques Approach for Industry
Ebook
Numerical Python: A Practical Techniques Approach for Industry
byRobert Johansson
Rating: 0 out of 5 stars
0 ratings
Joe Celko's Trees and Hierarchies in SQL for Smarties
Ebook
Joe Celko's Trees and Hierarchies in SQL for Smarties
byJoe Celko
Rating: 0 out of 5 stars
0 ratings
Practical Scientific Computing
Ebook
Practical Scientific Computing
byMuhammad Ali
Rating: 0 out of 5 stars
0 ratings
Introduction to Digital Signal Processing
Ebook
Introduction to Digital Signal Processing
byRobert Meddins
Rating: 3 out of 5 stars
3/5
Troubleshooting Finite-Element Modeling with Abaqus: With Application in Structural Engineering Analysis
Ebook
Troubleshooting Finite-Element Modeling with Abaqus: With Application in Structural Engineering Analysis
byRaphael Jean Boulbes
Rating: 0 out of 5 stars
0 ratings
Engineering a Compiler
Ebook
Engineering a Compiler
byKeith D. Cooper
Rating: 0 out of 5 stars
0 ratings
MATLAB Recipes: A Problem-Solution Approach
Ebook
MATLAB Recipes: A Problem-Solution Approach
byMichael Paluszek
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials - Second Edition
Ebook
Python Data Science Essentials - Second Edition
byBoschetti Alberto
Rating: 4 out of 5 stars
4/5
Learning Cypher
Ebook
Learning Cypher
byOnofrio Panzarino
Rating: 0 out of 5 stars
0 ratings
Signals and Systems using MATLAB
Ebook
Signals and Systems using MATLAB
byLuis Chaparro
Rating: 0 out of 5 stars
0 ratings
R Data Science Quick Reference: A Pocket Guide to APIs, Libraries, and Packages
Ebook
R Data Science Quick Reference: A Pocket Guide to APIs, Libraries, and Packages
byThomas Mailund
Rating: 0 out of 5 stars
0 ratings
Scientific Computing with Scala
Ebook
Scientific Computing with Scala
byVytautas Jančauskas
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Bringing Pure Python to Apache Kafka (with Tomáš Neubauer)
Podcast episode
Bringing Pure Python to Apache Kafka (with Tomáš Neubauer)
byDeveloper Voices
0 ratings
0% found this document useful
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
Podcast episode
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
byElixir Wizards
0 ratings
0% found this document useful
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh: Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
Podcast episode
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh: Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
byData Engineering Podcast
0 ratings
0% found this document useful
22. Luke Marsden - Data Science Infrastructure and MLOps
Podcast episode
22. Luke Marsden - Data Science Infrastructure and MLOps
byTowards Data Science
0 ratings
0% found this document useful
Reflecting On The Past 6 Years Of Data Engineering: This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.
Podcast episode
Reflecting On The Past 6 Years Of Data Engineering: This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.
byData Engineering Podcast
0 ratings
0% found this document useful
Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1
Podcast episode
Our 1st MLOps Meetup // Luke Marsden // MLOps Meetup #1
byMLOps.community
0 ratings
0% found this document useful
A "AI & ML" Look Ahead for 2020
Podcast episode
A "AI & ML" Look Ahead for 2020
byThe Cloudcast
0 ratings
0% found this document useful
The Birth and Growth of Spark: An Open Source Success Story // Matei Zaharia // MLOps Podcast #155
Podcast episode
The Birth and Growth of Spark: An Open Source Success Story // Matei Zaharia // MLOps Podcast #155
byMLOps.community
0 ratings
0% found this document useful
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Podcast episode
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
byData Engineering Podcast
0 ratings
0% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
Podcast episode
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
SoTaNa: The Open-Source Software Development Assistant: Software development plays a crucial role in driving innovation and efficiency across modern societies. To meet the demands of this dynamic field, there is a growing need for an effective software development assistant. However, existing large langua...
Podcast episode
SoTaNa: The Open-Source Software Development Assistant: Software development plays a crucial role in driving innovation and efficiency across modern societies. To meet the demands of this dynamic field, there is a growing need for an effective software development assistant. However, existing large langua...
byPapers Read on AI
0 ratings
0% found this document useful
From MLOps to DataOps - Santona Tuli
Podcast episode
From MLOps to DataOps - Santona Tuli
byDataTalks.Club
0 ratings
0% found this document useful
DSPy: Transforming Language Model Calls into Smart Pipelines // Omar Khattab // #194
Podcast episode
DSPy: Transforming Language Model Calls into Smart Pipelines // Omar Khattab // #194
byMLOps.community
0 ratings
0% found this document useful
Practical MLOps // Noah Gift // MLOps Coffee Sessions #27
Podcast episode
Practical MLOps // Noah Gift // MLOps Coffee Sessions #27
byMLOps.community
0 ratings
0% found this document useful
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
Podcast episode
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
byMLOps.community
0 ratings
0% found this document useful
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
Podcast episode
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
MLOps at the Crossroads // Patrick Barker & Farhood Etaati // #204
Podcast episode
MLOps at the Crossroads // Patrick Barker & Farhood Etaati // #204
byMLOps.community
0 ratings
0% found this document useful
#48 Managing Data Science Teams
Podcast episode
#48 Managing Data Science Teams
byDataFramed
0 ratings
0% found this document useful
Episode 400: JSJ 395: The New Ember with Mike North
Podcast episode
Episode 400: JSJ 395: The New Ember with Mike North
byJavaScript Jabber
0 ratings
0% found this document useful
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
Podcast episode
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
MLOps Coffee Sessions #6 // Continuous Integration for ML // Featuring Elle O'Brien
Podcast episode
MLOps Coffee Sessions #6 // Continuous Integration for ML // Featuring Elle O'Brien
byMLOps.community
0 ratings
0% found this document useful
418: Mental Models For Reduce Functions: Joël talks about his difficulties optimizing queries in ActiveRecord, especially with complex scopes and unions, resulting in slow queries. He emphasizes the importance of optimizing subqueries in unions to boost performance despite challenges such as query duplication and difficulty reusing scopes. Stephanie discusses upgrading a client's app to Rails 7, highlighting the importance of patience, detailed attention, and the benefits of collaborative work with a fellow developer. The conversation shifts to Ruby's reduce method (inject), exploring its complexity and various mental models to understand it. They discuss when it's preferable to use reduce over other methods like each, map, or loops and the importance of understanding the underlying operation you wish to apply to two elements before scaling up with reduce. The episode also touches on monoids and how they relate to reduce, suggesting that a deep understanding of functional programming
Podcast episode
418: Mental Models For Reduce Functions: Joël talks about his difficulties optimizing queries in ActiveRecord, especially with complex scopes and unions, resulting in slow queries. He emphasizes the importance of optimizing subqueries in unions to boost performance despite challenges such as query duplication and difficulty reusing scopes. Stephanie discusses upgrading a client's app to Rails 7, highlighting the importance of patience, detailed attention, and the benefits of collaborative work with a fellow developer. The conversation shifts to Ruby's reduce method (inject), exploring its complexity and various mental models to understand it. They discuss when it's preferable to use reduce over other methods like each, map, or loops and the importance of understanding the underlying operation you wish to apply to two elements before scaling up with reduce. The episode also touches on monoids and how they relate to reduce, suggesting that a deep understanding of functional programming
byThe Bike Shed
0 ratings
0% found this document useful
Build Better Machine Learning Models By Understanding Their Decisions With SHAP: An interview with Scott Lundberg about his work on SHAP and how it can be used to understand the reasoning behind your machine learning model's decisions.
Podcast episode
Build Better Machine Learning Models By Understanding Their Decisions With SHAP: An interview with Scott Lundberg about his work on SHAP and how it can be used to understand the reasoning behind your machine learning model's decisions.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
Podcast episode
MLOps Meetup #34: Streaming Machine Learning with Apache Kafka and Tiered Storage // Kai Waehner, Confluent
byMLOps.community
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Understanding Machine Learning Features and Platforms
Podcast episode
Understanding Machine Learning Features and Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Mastering Algorithms and Data Structures - Marcello La Rocca
Podcast episode
Mastering Algorithms and Data Structures - Marcello La Rocca
byDataTalks.Club
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful

Skip carousel

Zulip Economy
Linux Format
Article
Zulip Economy
Oct 20, 2020
10 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Use Katana For Lookdev And Lighting
3D World
Article
Use Katana For Lookdev And Lighting
Sep 7, 2021
3 min read
Neural Pathways
Guitar Magazine
Article
Neural Pathways
Jul 2, 2021
5 min read
Add A Little Funk To Mathematical Plots
Linux Format
Article
Add A Little Funk To Mathematical Plots
Jul 25, 2023
6 min read
Quantum Simulators An Overview
Techfastly
Article
Quantum Simulators An Overview
Oct 1, 2021
4 min read
Quantum Computing and The Rise Of Machine Learning
Techfastly
Article
Quantum Computing and The Rise Of Machine Learning
Oct 1, 2021
2 min read
Plotting applications The Verdict
Linux Format
Article
Plotting applications The Verdict
Mar 10, 2020
2 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
This PC Does Not Exist
Maximum PC
Article
This PC Does Not Exist
May 23, 2023
7 min read
Picture In A Mainframe
Linux Format
Article
Picture In A Mainframe
Jul 2, 2019
11 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Overall Usefulness
Linux Format
Article
Overall Usefulness
Sep 22, 2020
3 min read
Mainframe Mage
Linux Format
Article
Mainframe Mage
Jul 28, 2020
12 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
SYNC OR SWIM Rough Animator
Screen Education
Article
SYNC OR SWIM Rough Animator
Dec 1, 2019
11 min read
Tensor Flow 101
APC
Article
Tensor Flow 101
Jan 27, 2020
4 min read
A.i. Coding
Linux Format
Article
A.i. Coding
Aug 22, 2023
16 min read
Mailserver
Linux Format
Article
Mailserver
Jun 27, 2023
4 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
How Technology Commons Revolutionise Industry Foundations
The European Business Review
Article
How Technology Commons Revolutionise Industry Foundations
Feb 11, 2022
9 min read
Mailserver
Linux Format
Article
Mailserver
Aug 22, 2023
Do you have a burning Linuxrelated issue that you want to discuss? Write to us at Linux Format, Future Publishing, Quay House, The Ambury, Bath, BA1 1UA or email letters@ linuxformat.com. It has been said that one can tell what language a programmer
4 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
Other Features
Linux Format
Article
Other Features
Nov 15, 2022
1 min read
Mind Your Language!
Linux Format
Article
Mind Your Language!
Apr 4, 2023
9 min read
HotPicks
Linux Format
Article
HotPicks
Feb 9, 2021
13 min read
Build Your Own Plugins
Computer Music
Article
Build Your Own Plugins
Jun 16, 2021
Back in the olden days, many people would fiddle around with the innards of their studio kit to change how it operated and sounded. Things aren’t quite so simple in today’s digital studio. The complexity and exacting nature of digital audio hardware
1 min read
Emulate An Analogue Computer Digitally
Linux Format
Article
Emulate An Analogue Computer Digitally
Feb 6, 2024
11 min read
HotPicks
Linux Format
Article
HotPicks
Nov 19, 2019
12 min read

Related categories

Skip carousel

Reviews for Text Mining with MATLAB®

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Text Mining with MATLAB® - Rafael E. Banchs

Rafael E. BanchsText Mining with MATLAB®201310.1007/978-1-4614-4151-9_1© Springer Science+Business Media New York 2013

1. Introduction

Rafael E. Banchs¹

(1)

, , Barcelona

Rafael E. Banchs

Email: rafael.banchs@gmail.com

Abstract

The universality and ubiquity of the Internet in the current information society has changed human life in many different ways. One important element of this change is the possibility of accessing a virtually infinite amount of information in digital text format. Consequently, the text-oriented derivation of data mining, text mining, has been gaining attention as the available volume of textual information grows at a rate that is by far higher than our human capacity to handle and process such a huge volume of information.

This book introduces some of the fundamental concepts of text mining from an experimental perspective. It presents and illustrates all practical issues and implementations by using MATLAB® technical computing software,¹ a highly specialized programming language for numerical computing. The main contents of the book are presented at an introductory level, which should be useful for those audiences without any previous experience on using the MATLAB® programming environment or without any previous knowledge about text mining applications and techniques.

This introductory chapter is organized as follows. First, in Sect. 2.1, a brief discussion on text mining and the suitability of the MATLAB® product for text mining applications is presented. Next, in Sect. 2.2, more detailed information is provided about what to expect from this book and how to use it. Then, in Sect. 2.3, a very quick introduction to the MATLAB® programming environment is given.

1.1 About Text Mining and MATLAB®

Data mining, also referred to as knowledge discovery in data, can be defined as the science of extracting useful knowledge from […] huge data repositories.² In accordance to this, text mining refers to such a knowledge discovery process when the source data under consideration is text.

Strictly speaking, rather than specific areas of knowledge by themselves, data mining and text mining in general should be regarded as application-oriented interdisciplinary fields. In the particular case of text mining, it can be found to be closely related to disciplines such as natural language processing, computational linguistics and information retrieval, as well as to rely on important contributions from statistics, machine learning and artificial intelligence, in general. Because of its close relationship with and dependence on these related disciplines, precise definitions of the scope of text mining and its frontiers with these other disciplines cannot be easily depicted. In this sense, the notion of text mining only becomes clear for a given technique or application under the endeavor of discovering knowledge from a large collection of textual data.

Nowadays, with the increase of computational power and the access to a virtually unlimited amount of information in digital text format, text mining is becoming a very important tool for both providing competitive services to users and extracting valuable knowledge for business intelligence and marketing research applications.

As there are currently several text-oriented computational tools for text mining and data mining in general, you might be wondering why using a highly specialized numerical computing language such as the MATLAB® technical computing software for developing and implementing text mining applications. There are actually lots of reasons for recommending its use for text mining purposes such as, for instance:

it is a high level application-oriented language which is also relatively easy to learn and use,

it provides large number of algorithms and methods which are already programmed in the form of functions and toolboxes,

it allows for interfacing with other programming languages such as Fortran, C++ and Java,

it facilitates the creation of user interfaces and the generation of very high quality graphics and plots, and

it allows for debugging and deploying stand alone applications.

Nevertheless, apart from all these reasons, there is a conceptual and fundamental reason that makes the MATLAB® technical computing software an ideal tool for text mining purposes. Its name derives from MATrix LABoratory, as it was originally conceived as a programming language for manipulating and operating with matrices. On the other hand, as you will see in Chap. 8, one of the most popular ways of modeling and operating with textual data collections is by means of the vector space model, in which a complete collection of documents can be represented by means of a matrix, and most of the basic language processing operations can be conducted by means of matrix and vector operations. In this way, we can actually think about the MATLAB® software as the perfect programming environment for developing, implementing and deploying text mining applications and services.

According to this, the present book is an attempt to simultaneously introducing the unfamiliar reader to the basic concepts of text mining, as well as demonstrating the main advantages of using the MATLAB® technical computing software for implementing text mining applications.

1.2 About this Book

Before getting into technical matters, let us present in more detail what is and what is not this book about, as well as provide some basic but useful tips on how to use this book.

The book is structured in three main parts: Fundamentals, Mathematical Models and Techniques and Applications. The first part, Fundamentals, is devoted to introducing basic procedures and methods for manipulating and operating with text within the MATLAB® programming environment. It comprises Chaps. 2–5, in which text variables, regular expressions, basic string operations and file read/write operations and formats are introduced. More specifically:

Chapter 2 focuses on the different types of variables that can be used for manipulating text, as well as it introduces some basic MATLAB® built in functions for operating with strings and other functions of more general use.

Chapter 3 is completely devoted to the specificities and use of regular expressions in the MATLAB® programming environment.

Chapter 4 focuses on basic operations with strings, such as search, replacement, segmentation, concatenation and some basic sets operations that can be applied to string and character sets.

Chapter 5 deals with reading and writing text files and describes some commonly used file formats. Also, in Chap. 5, some basic functions for operating with directories and files are presented and described.

The second part of the book, Mathematical Models, is devoted to motivate, introduce and explain the two most commonly used paradigms of mathematical models for representing textual data: the statistical approach and the geometric approach. It comprises Chaps. 6–9. More specifically:

Chapter 6 introduces the main concepts related to corpus statistics. First, it presents and discusses some fundamental properties of language such as the Zipf’s law of frequencies and the phenomenon of burstsiness. Then, it introduces the notion of word co-occurrences and the incidence of word order information.

Chapter 7 is devoted to the statistical modeling approach. It introduces the basic n-gram model, and the fundamental concepts of discounting and model interpolation. Additionally, a brief introduction to statistical bag-of-words models is also presented.

Chapter 8 focuses on the geometrical modeling approach. It starts by presenting the concept of the term-document matrix and then extends it to the vector space model representation. The notions of distance and similarity are also presented and the most commonly used association scores are introduced.

Chapter 9 is devoted to the specific problem of dimensionality reduction. It introduces the ideas of vocabulary pruning and merging, as well as some fundamental linear and non-linear projection methods.

The third part of the book, Techniques and Applications, is devoted to some general problems in text mining and natural language processing applications. More specifically, the problems of document categorization, document search and content analysis are addressed in Chaps. 10, 11 and 12, respectively.

Chapter 10 focuses on the problem of document categorization. It presents and illustrates basic techniques for unsupervised clustering and supervised classification. The case of supervised classification is addressed from both, vector space and statistical modeling approaches. Also, in this chapter, basic methods for extracting terminology that is relevant to a given document category are illustrated.

Chapter 11 focuses on the problem of document search. More specifically, the binary search and the vector-based search approaches are described and illustrated. This chapter also introduces the basic metrics of precision and recall, as well as some other fundamental concepts of Information Retrieval, such as query expansion and relevance ranking. Finally, the problem of cross-language search is introduced.

Chapter 12 deals with the problem of content analysis. Although this is indeed a very broad concept, this chapter focuses on two specific types of content analysis: polarity estimation and property extraction. In the first case, the problems of detecting polarity and estimating its intensity within the context of opinionated texts is presented and discussed. In the second case, the problem of extracting properties and other specific informational elements by means of text-pattern matching is introduced and illustrated.

The main audience this book was conceived for are those persons with very little or none previous knowledge about text mining techniques and applications that are also not familiar with the MATLAB® programming environment. If you belong to this group you should be able to benefit from, as well as enjoy, all the chapters in this book.

However, this book should be also useful for experienced text mining practitioners who are not familiar with the MATLAB® technical computing software. In this case, you should focus your attention into the chapters contained in the first and second parts of the book: Fundamentals and Mathematical Models.

Similarly, this book should be also useful for experienced MATLAB® users without any previous experience in text mining applications. In this case, you should focus your attention into the chapters contained in the second and third parts of the book: Mathematical Models and Techniques and Applications. You might also need to review Chap. 3, which introduces regular expressions.

Otherwise, if you are both familiar with the MATLAB® programming environment and have experience in text mining: this book is not for you! You should be already acquainted with most of the materials presented along the book.

In addition to the main technical sections, each chapter in the book contains also three additional sections: further reading, proposed exercises and references. All these sections provide complementary materials aimed at reinforcing the main concepts covered by the technical sections, and further exploring some related concepts. The chapters in the second and third parts of the book also include an additional section entitled short projects, which proposes more broad and challenging exercises related to the problems described within the chapter.

All the examples illustrated in this book are fully reproducible from the MATLAB® command window. In this sense, you should be able to follow the explanations in each section of the book and reproduce the very same results presented therein in your own computer. Most of the required data files and functions you will need to reproduce the examples along the book are available from the companion website www.textmininglab.net. For some specific examples, in which you will need to get the data by yourself, the pointers to the corresponding sources are provided in the companion website.

The specific MATLAB® version that was used in the preparation of all examples in this book is 7.12.0.635 (R2011a). So, you might expect small differences or inconsistencies in some examples if you are using a different version, especially if you are using an older one. For more information about possible clashes among different MATLAB® versions you must consult the corresponding updates in the companion website of the book.

All the code presented in this book has been created with two specific objectives in mind: intelligibility and demonstrativeness. In this way, example codes are meant to be understandable, as well as to be able to demonstrate the different potentialities offered by the MATLAB® product, but they are not meant to be efficient! Efficiency has been not considered as a major criterion for code development in this book. However, some efficiency issues are actually noted and left to you as exercises in the proposed exercises sections of the corresponding chapters.

The examples in any chapter are totally independent from the examples in the others, so you must be able to reproduce the examples in a given chapter without the need for executing any code from previous chapters. However, this is not the case for the examples within the same chapter, as in most of the cases the results of a given example are used as inputs for the subsequent ones. In this sense, you must be acquainted with the use of MATLAB® functions save and load , which will allow you to save your work session and restore it later on. A brief presentation of these functions is given in the following section, while a more detailed description is provided in Sect. 5.1.

A final word on how to use this book must be given regarding the companion website www.textmininglab.net, which, more than a simple repository for data files, code files and update notices, intends to be a space for interacting and sharing your text mining knowledge and experiences. In this way, you are strongly encouraged to contribute to this initiative by submitting and posting in the website your answers to the different exercises and short projects in this book. Similarly, you are encouraged also to submit and post new exercises, projects, comments, questions and recommendations. This will provide future generations of text mining with MATLAB® enthusiasts with useful knowledge and valuable resources from which to leverage on.

1.3 A (Very) Brief Introduction to MATLAB®

MATLAB® stands for MATrix LABoratory. It is a highly specialized programming language for numerical computing, which has been specially designed for efficiently operating with matrices. It is an interpreted language, which means that you need the MATLAB® interpreter to execute MATLAB® code. However, it also provides specific tools for creating stand alone applications. Here, we will restrict our use of the MATLAB® software as an interpreter.

The first thing you need to do is to launch the MATLAB® environment. It will open a window including different elements on it. The most important ones are the command window and the workspace. The command window allows you to execute MATLAB® commands one at a time. The workspace contains and displays all the variables that are currently accessible from the command window.

Once you have launched the MATLAB® environment, you can try reproducing the following examples from the command window. First, let us create a matrix:

A273011_1_En_1_Figa_HTML.gif

(1.1)

From this example you can see that creating a matrix by assigning its entry values is indeed very simple. The list of values must be given within brackets. The semicolon is used for vertical concatenation and the comma, or alternatively the white space, is used for horizontal concatenation.

Indexing operations for accessing specific elements within the matrix are also very simple and intuitive:

A273011_1_En_1_Figb_HTML.gif

(1.2a)

A273011_1_En_1_Figc_HTML.gif

(1.2b)

A273011_1_En_1_Figd_HTML.gif

(1.2c)

A273011_1_En_1_Fige_HTML.gif

(1.2d)

Notice from the examples in (1.2) how parentheses have been used for retrieving the matrix contents. Notice also that if the output of an operation is not explicitly assigned to a variable, it is assigned to a default variable called ans .

The content of the workspace can be listed at any moment by using the function whos :

A273011_1_En_1_Figf_HTML.gif

(1.3)

You can save all the variables in the workspace by means of the function save . This will create a binary file called matlab.mat . You can restore your work session by using the function load to read the file matlab.mat and upload your saved variables back into the workspace. Let us illustrate this in the following example:

A273011_1_En_1_Figg_HTML.gif

(1.4a)

A273011_1_En_1_Figh_HTML.gif

(1.4b)

A273011_1_En_1_Figi_HTML.gif

(1.4c)

A273011_1_En_1_Figj_HTML.gif

(1.4d)

A273011_1_En_1_Figk_HTML.gif

(1.4e)

The MATLAB® programming environment also supports most of the commonly used statements in other programming languages, such as for , while , if , then , etc. For displaying a description on how to use them, you can use the help function. For instance, you can execute: help for , help while , help if , and so on.

Regarding the use of for , it is worth mentioning that in a wide variety of cases, and due to MATALB® matrix-oriented design, it can be avoided. Suppose, for instance, that you need to create a vector containing odd integers between 0 and 10. While, conventionally, you would need to do something like this:

A273011_1_En_1_Figl_HTML.gif

(1.5)

within the MATLAB® environment the same can be done as follows:

A273011_1_En_1_Figm_HTML.gif

(1.6)

Two important observations must be made with respect to the example in (1.5). First, notice that the use of a semicolon at the end of a command line prevents from displaying the output of the operation in the command window. This is especially useful when dealing with large matrices and vectors. In general, unless you are really interested in looking at the result of a given operation, the common practice will be to end each command with a semicolon.

The second observation is that, different from C++ and other languages in which array indexes start with zero, MATLAB® array indexes must start with 1.

Matrix operations are also very simple and intuitive. In the following example we create a 2 × 3 matrix by multiplying a column vector created from the first two elements of newvector times a row vector containing the three last elements of newvector :

A273011_1_En_1_Fign_HTML.gif

(1.7)

Notice from (1.7) how the apostrophe has been used for transposing the 1 × 2 row vector newvector (1:2) into a 2 × 1 column vector. Notice also how the end keyword has been used to make reference to the last index of the vector.

Two types of mathematical operations are to be distinguished within the MATLAB® programming environment, namely matrix operations and array operations. In the case of addition and subtraction, they are totally equivalent, but in the cases of multiplication, division and exponentiation they produce completely different results. While matrix operations refer to the conventional mathematical definition of matrix operations, array operations refer to operations that are carried out on an element-by-element basis. The following example illustrates such a difference for the specific case of multiplication:

A273011_1_En_1_Figo_HTML.gif

(1.8a)

A273011_1_En_1_Figp_HTML.gif

(1.8b)

As seen from (1.8a), the matrix multiplication implements the mathematical definition of such kind of operation, where the element ij in the resulting matrix is computed by adding up all the products between the elements in row i of the first matrix and the elements in column j of the second matrix. For instance, the 46 in (1.8a) results from the following operation 1 * 5 + 2 * 7 + 3 * 9. Notice how the multiplication of a 2 × 3 matrix ( matrix ) times a 3 × 2 matrix ( newmatrix' ) has resulted in a 2 × 2 matrix.

On the other hand, as seen from (1.8b), the array multiplication implements an element-by-element multiplication, in which the element ij in the resulting matrix corresponds to the product of the element ij of the first matrix times the element ij of the second matrix. In this case, the dimensions of the resulting matrix are the same to those of the matrices being multiplied. Notice that the operator for array multiplication is given by ‘.*’. In general, pre-appending a dot to an arithmetic operator invokes the corresponding array operation.

Recall that implementing any of the two operations in (1.8a) and (1.8b) in a conventional programming language requires two nested for–end loops, one for moving along the rows of the resulting matrix and the other for moving along the columns!

Although we will not be using it in this book, another very useful feature of the MATLAB® programming environment is that it allows for handling complex numbers in a very natural way too. For instance, let us create a complex-valued matrix with the real parts derived from matrix and the imaginary parts derived from newmatrix :

A273011_1_En_1_Figq_HTML.gif

(1.9)

Operations with complex-valued matrices and vectors can be carried out in exactly the same ways operations with real-valued matrices and vectors are conducted.

Finally, before concluding this (very) brief introduction to MATLAB®, let us discuss in more detail about the issue of writing and executing programs. In addition to the option of executing commands one-by-one in the command window, it is also possible to write and save code in specific files denominated m-files.

An m-file is actually a text file containing MATLAB® code. The two main types of m-files are scripts and functions. Both types of m-files can be created in the MATLAB® text editor or, alternative, in any other text editor able to generate plain text files. Additionally, both types of m-files can be executed directly from the command window, and invoked from other m-files (either a script or a function).

They main difference between scripts and functions is that scripts are executed over the main workspace, the same you use when executing commands directly from the command window. This means that all variables created and used in the script are loaded into the workspace, as well as all previously existing variables in the workspace are accessible and can be used from the script. You can think of a script as a segment of code that you better write down into a text file and execute it all as single command (the script file name), rather than having to input and execute the same code, but one command at a time.

On the other hand, a function is executed in a dedicated workspace that is created when the function is called and discarded when the function execution ends. According to this, functions do not have access to the variables in the main workspace. So, they need to receive, as input variables during the function call, those workspace variables they must use. Equivalently, internal variables created during the function execution are not visible from the main workspace. So, variables of interest must be returned by the functions as output variables.

For declaring an m-file to be a function, the first line of the file must obey the following syntax:

A273011_1_En_1_Figr_HTML.gif

(1.10)

where out_n refers to the output variables returned by the function, and in_m refers to the input variables received by the function. Both, the total number of output and input variables, N and M , can be any integer or zero. The m-file containing a given function function_name must be named as function_name.m .

Although you can create your own functions, one of the main advantages of the MATLAB® technical computing software is that it has already available hundreds of functions for you to use. A simple example, for illustrating the use of a function here, can be the one for computing the inverse of a matrix:

A273011_1_En_1_Figs_HTML.gif

(1.11)

We will be using both scripts and functions along this book depending on what happens to be the most convenient thing to do in each specific situation.

One final advice regarding the issue of writing and using scripts and functions, which also applies to the creation and use of variables in general, is that you must be careful about not using existing script, function or variable names for your newly created scripts, functions or variables. This kind of omissions can result in mysterious bugs that are difficult to track and solve. For avoiding name collisions in the specific case of variables, the function genvarname can be used to generate valid variable names that are different from other existent variables.

1.4 Further Reading

There are several introductory and advanced books related to text mining theory and applications. Some good examples include (Feldman and Sanger 2006; Srivastava and Sahami 2009; Berry and Kogan 2010). Other good reference books in the related field of natural language processing include (Manning and Schütze 1999; Jurafsky and Martin 2000).

There are also several introductory books to MATLAB®. Some good references include (Palm III 2007; Gilat 2008; Pratap 2009; Etter 2010). However, the most comprehensive and updated guide to MATLAB® can be found in the corresponding online Product Documentation (The MathWorks 2011).

References

Berry MW, Kogan J (2010) Text mining: applications and theory. John Wiley & Sons, Padstow

Etter DM (2010) Introduction to MATLAB, 2nd edn. Prentice Hall, New Jersey

Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge

Gilat A (2008) MATLAB: an introduction with applications, 3rd edn. John Wiley & Sons, New Jersey

Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing computational linguistics and speech recognition. Prentice-Hall, New Jersey

Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge

The MathWorks (2011) MATLAB product documentation. http://www.mathworks.com/help/techdoc/learn_matlab/bqr_2pl.html. Accessed 6 Nov 2011

Palm III WJ (2007) A concise introduction to MATLAB. McGraw Hill Higher Education, New York

Pratap R (2009) Getting started with MATLAB: a quick introduction for scientists and engineers. Oxford University Press, New York

Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Data mining and knowledge discovery series. Chapman and Hall/CRC, London

Footnotes

MATLAB® is a registered trademark of The MathWorks, Inc.

This definition is taken from the Curriculum Proposal of the ACM Special Interest Group on Knowledge Discovery and Data Mining, http://www.sigkdd.org/curriculum.php. Accessed 19 November 2011.

Part 1

FUNDAMENTALS

Rafael E. Banchs

Text Mining with MATLAB®

Springer

Rafael E. Banchs

A*Star, Institute for Infocomm Research, Queenstown, Singapore

rembanchs@i2r.a-star.edu.sg

ISBN 978-1-4614-4150-2e-ISBN 978-1-4614-4151-9

Fundamental brainwork, is what makes the difference in all art.

Dante Gabriel Rossetti

2. Handling Textual Data

Rafael E. Banchs¹

(1)

, , Barcelona

Rafael E. Banchs

Email: rafael.banchs@gmail.com

Abstract

This chapter introduces the main variable classes that are used in the MATLAB® programming environment for representing and handling text. First, in Sect. 2.1, the basic variable type for representing text, which is the character array, is described. Then, in Sects. 2.2 and 2.3, cell arrays and structures, which are the most commonly used variable classes for handling and operating with text, are described, respectively. Finally, in Sect. 2.4, a brief overview is provided on the specific MATLAB® built-in functions for operating with text, as well as other useful functions worth to be known.

2.1 Characters and Character Arrays

The basic variable type for representing text in the MATLAB® programming environment is the character. A character is a variable used to represent symbols in some predefined encoding system. Depending on the encoding scheme being used, a character can be represented with one, two or more bytes. By default, when writing and reading text files from your system, the MATLAB® environment uses the default encoding of the operating system. However, in the case of MATLAB® data files, the unicode encoding system is used. This guarantees the portability of data files across systems. The specific encoding of a given text can be changed at any moment by using MATLAB® functions native2unicode and unicode2native . More details on the encoding scheme issue will be given in Chap. 5, where we will focus our attention in the problem of writing and reading text files.

A text string is represented as an array (or matrix) of characters. For defining a variable as a character array, it is required that its value is provided within apostrophes. Try the following example in the command line:

A273011_1_En_2_Figa_HTML.gif

(2.1)

Such a command defines and initializes the variable string as a

Enjoying the preview?

Page 1 of 1

Text Mining with MATLAB®

About this ebook

Rafael E. Banchs

Related authors

Related to Text Mining with MATLAB®

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Text Mining with MATLAB®

What did you think?

Book preview

Text Mining with MATLAB® - Rafael E. Banchs

1. Introduction

1.1 About Text Mining and MATLAB®

1.2 About this Book

1.3 A (Very) Brief Introduction to MATLAB®

1.4 Further Reading

2. Handling Textual Data

2.1 Characters and Character Arrays