Ebook552 pages5 hours

Text Mining in Practice with R

Name: Text Mining in Practice with R
Author: Ted Kwartler
ISBN: 9781119282082

By Ted Kwartler

Rating: 0 out of 5 stars

()

Read preview

About this ebook

A reliable, cost-effective approach to extracting priceless business information from all sources of text

Excavating actionable business insights from data is a complex undertaking, and that complexity is magnified by an order of magnitude when the focus is on documents and other text information. This book takes a practical, hands-on approach to teaching you a reliable, cost-effective approach to mining the vast, untold riches buried within all forms of text using R.

Author Ted Kwartler clearly describes all of the tools needed to perform text mining and shows you how to use them to identify practical business applications to get your creative text mining efforts started right away. With the help of numerous real-world examples and case studies from industries ranging from healthcare to entertainment to telecommunications, he demonstrates how to execute an array of text mining processes and functions, including sentiment scoring, topic modelling, predictive modelling, extracting clickbait from headlines, and more. You’ll learn how to:

Identify actionable social media posts to improve customer service
Use text mining in HR to identify candidate perceptions of an organisation, match job descriptions with resumes, and more
Extract priceless information from virtually all digital and print sources, including the news media, social media sites, PDFs, and even JPEG and GIF image files
Make text mining an integral component of marketing in order to identify brand evangelists, impact customer propensity modelling, and much more

Most companies’ data mining efforts focus almost exclusively on numerical and categorical data, while text remains a largely untapped resource. Especially in a global marketplace where being first to identify and respond to customer needs and expectations imparts an unbeatable competitive advantage, text represents a source of immense potential value. Unfortunately, there is no reliable, cost-effective technology for extracting analytical insights from the huge and ever-growing volume of text available online and other digital sources, as well as from paper documents—until now.

Skip carousel

Mathematics

LanguageEnglish

PublisherWiley

Release dateMay 12, 2017

ISBN9781119282082

Author

Ted Kwartler

Related authors

Skip carousel

Related to Text Mining in Practice with R

Related ebooks

Skip carousel

Keras to Kubernetes: The Journey of a Machine Learning Model to Production
Ebook
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
byDattaraj Rao
Rating: 0 out of 5 stars
0 ratings
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Ebook
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
byAbhishek Mishra
Rating: 0 out of 5 stars
0 ratings
Functional Programming in C++
Ebook
Functional Programming in C++
byIvan Cukic
Rating: 0 out of 5 stars
0 ratings
Multidisciplinary Design Optimization Supported by Knowledge Based Engineering
Ebook
Multidisciplinary Design Optimization Supported by Knowledge Based Engineering
byJaroslaw Sobieszczanski-Sobieski
Rating: 0 out of 5 stars
0 ratings
Statistical Data Cleaning with Applications in R
Ebook
Statistical Data Cleaning with Applications in R
byMark van der Loo
Rating: 0 out of 5 stars
0 ratings
Machine Learning for iOS Developers
Ebook
Machine Learning for iOS Developers
byAbhishek Mishra
Rating: 0 out of 5 stars
0 ratings
Joe Celko's Trees and Hierarchies in SQL for Smarties
Ebook
Joe Celko's Trees and Hierarchies in SQL for Smarties
byJoe Celko
Rating: 5 out of 5 stars
5/5
Beginning C# 6 Programming with Visual Studio 2015
Ebook
Beginning C# 6 Programming with Visual Studio 2015
byBenjamin Perkins
Rating: 0 out of 5 stars
0 ratings
Responsible Data Science
Ebook
Responsible Data Science
byPeter C. Bruce
Rating: 0 out of 5 stars
0 ratings
Machine Learning: Hands-On for Developers and Technical Professionals
Ebook
Machine Learning: Hands-On for Developers and Technical Professionals
byJason Bell
Rating: 0 out of 5 stars
0 ratings
Python for Data Science For Dummies
Ebook
Python for Data Science For Dummies
byJohn Paul Mueller
Rating: 0 out of 5 stars
0 ratings
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Ebook
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
byWouter Verbeke
Rating: 0 out of 5 stars
0 ratings
Heterogeneous Computing with OpenCL 2.0
Ebook
Heterogeneous Computing with OpenCL 2.0
byDavid R. Kaeli
Rating: 0 out of 5 stars
0 ratings
Ivor Horton's Beginning Visual C++ 2013
Ebook
Ivor Horton's Beginning Visual C++ 2013
byIvor Horton
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics for Large-Scale Multimedia Search
Ebook
Big Data Analytics for Large-Scale Multimedia Search
byStefanos Vrochidis
Rating: 0 out of 5 stars
0 ratings
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
Ebook
AWS Certified Machine Learning Study Guide: Specialty (MLS-C01) Exam
byShreyas Subramanian
Rating: 0 out of 5 stars
0 ratings
Patterns, Principles, and Practices of Domain-Driven Design
Ebook
Patterns, Principles, and Practices of Domain-Driven Design
byScott Millett
Rating: 0 out of 5 stars
0 ratings
Internet of Things: Architectures, Protocols and Standards
Ebook
Internet of Things: Architectures, Protocols and Standards
bySimone Cirani
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning
Ebook
Python Machine Learning
byWei-Meng Lee
Rating: 5 out of 5 stars
5/5
The Mathematica® Programmer
Ebook
The Mathematica® Programmer
byRoman E. Maeder
Rating: 4 out of 5 stars
4/5
A Practical Guide to Data Mining for Business and Industry
Ebook
A Practical Guide to Data Mining for Business and Industry
byAndrea Ahlemeyer-Stubbe
Rating: 0 out of 5 stars
0 ratings
Daily Knowledge Valuation in Organizations: Traceability and Capitalization
Ebook
Daily Knowledge Valuation in Organizations: Traceability and Capitalization
byNada Matta
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Time Series Forecasting with Python
Ebook
Machine Learning for Time Series Forecasting with Python
byFrancesca Lazzeri
Rating: 4 out of 5 stars
4/5
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
Ebook
Soft Computing Techniques for Duplicate Question Detection in Transliterated Bilingual Data
bySeema Rani
Rating: 0 out of 5 stars
0 ratings
Exploring the Python Library Ecosystem: A Comprehensive Guide
Ebook
Exploring the Python Library Ecosystem: A Comprehensive Guide
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Data Analytics & Visualization All-in-One For Dummies
Ebook
Data Analytics & Visualization All-in-One For Dummies
byJack A. Hyman
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence Programming with Python: From Zero to Hero
Ebook
Artificial Intelligence Programming with Python: From Zero to Hero
byPerry Xiao
Rating: 0 out of 5 stars
0 ratings
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
Ebook
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
byKeith R. Holdaway
Rating: 0 out of 5 stars
0 ratings
Jump into JMP Scripting, Second Edition
Ebook
Jump into JMP Scripting, Second Edition
byWendy Murphrey
Rating: 0 out of 5 stars
0 ratings
JavaScript and Open Data
Ebook
JavaScript and Open Data
byRobert Jeansoulin
Rating: 0 out of 5 stars
0 ratings

Mathematics For You

Skip carousel

Geometry For Dummies
Ebook
Geometry For Dummies
byMark Ryan
Rating: 5 out of 5 stars
5/5
Basic Math & Pre-Algebra For Dummies
Ebook
Basic Math & Pre-Algebra For Dummies
byMark Zegarelli
Rating: 4 out of 5 stars
4/5
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
Ebook
Statistics 101: From Data Analysis and Predictive Modeling to Measuring Distribution and Determining Probability, Your Essential Guide to Statistics
byDavid Borman
Rating: 4 out of 5 stars
4/5
Mental Math Secrets - How To Be a Human Calculator
Ebook
Mental Math Secrets - How To Be a Human Calculator
byRandy Silverman
Rating: 5 out of 5 stars
5/5
Algebra - The Very Basics
Ebook
Algebra - The Very Basics
byMetin Bektas
Rating: 5 out of 5 stars
5/5
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
Ebook
Build a Mathematical Mind - Even If You Think You Can't Have One: Become a Pattern Detective. Boost Your Critical and Logical Thinking Skills.
byAlbert Rutherford
Rating: 5 out of 5 stars
5/5
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
Ebook
The Everything Everyday Math Book: From Tipping to Taxes, All the Real-World, Everyday Math Skills You Need
byChristopher Monahan
Rating: 5 out of 5 stars
5/5
Pre-Calculus For Dummies
Ebook
Pre-Calculus For Dummies
byYang Kuang
Rating: 5 out of 5 stars
5/5
Calculus For Dummies
Ebook
Calculus For Dummies
byMark Ryan
Rating: 4 out of 5 stars
4/5
Algebra I Workbook For Dummies
Ebook
Algebra I Workbook For Dummies
byMary Jane Sterling
Rating: 3 out of 5 stars
3/5
Quantum Physics for Beginners
Ebook
Quantum Physics for Beginners
byMax Thomson
Rating: 4 out of 5 stars
4/5
The Elements of Euclid for the Use of Schools and Colleges (Illustrated)
Ebook
The Elements of Euclid for the Use of Schools and Colleges (Illustrated)
byISAAC TODHUNTER
Rating: 0 out of 5 stars
0 ratings
The Little Book of Mathematical Principles, Theories & Things
Ebook
The Little Book of Mathematical Principles, Theories & Things
byRobert Solomon
Rating: 3 out of 5 stars
3/5
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
Ebook
The Everything Guide to Pre-Algebra: A Helpful Practice Guide Through the Pre-Algebra Basics - in Plain English!
byJane Cassie
Rating: 5 out of 5 stars
5/5
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
Ebook
The Everything Guide to Algebra: A Step-by-Step Guide to the Basics of Algebra - in Plain English!
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
Introducing Game Theory: A Graphic Guide
Ebook
Introducing Game Theory: A Graphic Guide
byIvan Pastine
Rating: 4 out of 5 stars
4/5
Relativity: The special and the general theory
Ebook
Relativity: The special and the general theory
byAlbert Einstein
Rating: 5 out of 5 stars
5/5
The Golden Ratio: The Divine Beauty of Mathematics
Ebook
The Golden Ratio: The Divine Beauty of Mathematics
byGary B. Meisner
Rating: 5 out of 5 stars
5/5
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
Ebook
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
byS. Deviant
Rating: 4 out of 5 stars
4/5
The Thirteen Books of the Elements, Vol. 1
Ebook
The Thirteen Books of the Elements, Vol. 1
byEuclid
Rating: 0 out of 5 stars
0 ratings
Calculus Made Easy
Ebook
Calculus Made Easy
bySilvanus P. Thompson
Rating: 4 out of 5 stars
4/5
Practice Makes Perfect Algebra II Review and Workbook, Second Edition
Ebook
Practice Makes Perfect Algebra II Review and Workbook, Second Edition
byChristopher Monahan
Rating: 4 out of 5 stars
4/5
Precalculus: A Self-Teaching Guide
Ebook
Precalculus: A Self-Teaching Guide
bySteve Slavin
Rating: 4 out of 5 stars
4/5
ACT Math & Science Prep: Includes 500+ Practice Questions
Ebook
ACT Math & Science Prep: Includes 500+ Practice Questions
byKaplan Test Prep
Rating: 3 out of 5 stars
3/5
Is God a Mathematician?
Ebook
Is God a Mathematician?
byMario Livio
Rating: 4 out of 5 stars
4/5
The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics
Ebook
The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics
byClifford A. Pickover
Rating: 3 out of 5 stars
3/5
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
Ebook
The Math of Life and Death: 7 Mathematical Principles That Shape Our Lives
byKit Yates
Rating: 4 out of 5 stars
4/5
Game Theory: A Simple Introduction
Ebook
Game Theory: A Simple Introduction
byK.H. Erickson
Rating: 4 out of 5 stars
4/5
Flatland
Ebook
Flatland
byEdwin A. Abbott
Rating: 4 out of 5 stars
4/5
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
Ebook
Real Estate by the Numbers: A Complete Reference Guide to Deal Analysis
byJ Scott
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
Podcast episode
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
Podcast episode
Powering your Copilot for Data – with Artem Keydunov of Cube.dev
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
Podcast episode
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
byElixir Wizards
0 ratings
0% found this document useful
Actor Model and Concurrent Processing in Elixir vs. Clojure and Ruby with Xiang Ji & Nathan Hessler: In this episode of Elixir Wizards, Xiang Ji and Nathan Hessler join hosts Sundi Myint and Owen Bickford to compare actor model implementation, concurrent processing, and GenServers in Elixir, Ruby, and Clojure.
Podcast episode
Actor Model and Concurrent Processing in Elixir vs. Clojure and Ruby with Xiang Ji & Nathan Hessler: In this episode of Elixir Wizards, Xiang Ji and Nathan Hessler join hosts Sundi Myint and Owen Bickford to compare actor model implementation, concurrent processing, and GenServers in Elixir, Ruby, and Clojure.
byElixir Wizards
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
Podcast episode
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
ML Lifecycle with Dale Markowitz and Craig Wiley: Jenny Brown co-hosts with Mark Mirchandani this week for a great conversation about the ML lifecycle with our guests Craig Wiley and Dale Markowitz.
Podcast episode
ML Lifecycle with Dale Markowitz and Craig Wiley: Jenny Brown co-hosts with Mark Mirchandani this week for a great conversation about the ML lifecycle with our guests Craig Wiley and Dale Markowitz.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
JavaScript × STUMP’D: In this episode of Syntax, Wes and Scott ask each other hiring questions asked of JavaScript developers in job interviews. Kontent by Kentico - Sponsor Kontent by Kentico is a headless CMS that provides live editing experience to non-technical users...
Podcast episode
JavaScript × STUMP’D: In this episode of Syntax, Wes and Scott ask each other hiring questions asked of JavaScript developers in job interviews. Kontent by Kentico - Sponsor Kontent by Kentico is a headless CMS that provides live editing experience to non-technical users...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Cory O'Daniel and the Future of DevOps in Elixir Programming: In this episode of Elixir Wizards, Cory O'Daniel, CEO of Massdriver, talks with Sundi and Owen about the role of DevOps in the future of Elixir programming. They discuss the advantages of using Elixir for cloud infrastructure and the challenges of securing cloud systems. They elaborate on their hopes for the future, including processes and automation to streamline operations so programmers can spend more time doing what they love … writing software!
Podcast episode
Cory O'Daniel and the Future of DevOps in Elixir Programming: In this episode of Elixir Wizards, Cory O'Daniel, CEO of Massdriver, talks with Sundi and Owen about the role of DevOps in the future of Elixir programming. They discuss the advantages of using Elixir for cloud infrastructure and the challenges of securing cloud systems. They elaborate on their hopes for the future, including processes and automation to streamline operations so programmers can spend more time doing what they love … writing software!
byElixir Wizards
0 ratings
0% found this document useful
Understanding Machine Learning Features and Platforms
Podcast episode
Understanding Machine Learning Features and Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Monitoring, Metrics and M3, with Martin Mao and Rob Skillington: Martin Mao and Rob Skillington are co-founders of Chronosphere; CEO and CTO respectively. They both worked on the monitoring team at Uber, where they created M3 - a metrics platform with an open source time-series database built for scale. They join hosts Craig and Adam to talk about monitoring, metrics and M3 on the last episode of 2019.
Podcast episode
Monitoring, Metrics and M3, with Martin Mao and Rob Skillington: Martin Mao and Rob Skillington are co-founders of Chronosphere; CEO and CTO respectively. They both worked on the monitoring team at Uber, where they created M3 - a metrics platform with an open source time-series database built for scale. They join hosts Craig and Adam to talk about monitoring, metrics and M3 on the last episode of 2019.
byKubernetes Podcast from Google
0 ratings
0% found this document useful
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
Podcast episode
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
byEpicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies
0 ratings
0% found this document useful
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
Podcast episode
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
byDigital Pathology Podcast
0 ratings
0% found this document useful
React + TypeScript: In this episode of Syntax, Scott and Wes talk about using React with Typescript — how to set it up, components, state, props, passing data, custom hooks, and more! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put...
Podcast episode
React + TypeScript: In this episode of Syntax, Scott and Wes talk about using React with Typescript — how to set it up, components, state, props, passing data, custom hooks, and more! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
A murder mystery: who killed our user experience?: On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
Podcast episode
A murder mystery: who killed our user experience?: On this sponsored episode of the Stack Overflow Podcast, we talk with Greg Leffler of Splunk about the keys to instrumenting an observable system and how the OpenTelemetry standard makes observability easier, even if you aren’t using Splunk’s product.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Prompt2Model: Generating Deployable Models from Natural Language Instructions: Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward fr...
Podcast episode
Prompt2Model: Generating Deployable Models from Natural Language Instructions: Large language models (LLMs) enable system builders today to create competent NLP systems through prompting, where they only need to describe the task in natural language and provide a few examples. However, in other ways, LLMs are a step backward fr...
byPapers Read on AI
0 ratings
0% found this document useful
Proposing Annoyance Mining: A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems. This episode explores why such a (seemingly obvious) flaw might make sense from an engineering...
Podcast episode
Proposing Annoyance Mining: A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems. This episode explores why such a (seemingly obvious) flaw might make sense from an engineering...
byData Skeptic
0 ratings
0% found this document useful
Episode 87: Software Components: In this episode, Michael and Markus talk about software components. We first looked at a couple of attempts at defining what a component is. We then provided our own definition that will be used in the rest of the episode.
Podcast episode
Episode 87: Software Components: In this episode, Michael and Markus talk about software components. We first looked at a couple of attempts at defining what a component is. We then provided our own definition that will be used in the rest of the episode.
bySoftware Engineering Radio - the podcast for professional software developers
0 ratings
0% found this document useful
Ben McMillan - Attack of the Liquid Alternative Clones (S2E3)
Podcast episode
Ben McMillan - Attack of the Liquid Alternative Clones (S2E3)
byFlirting with Models
0 ratings
0% found this document useful
Semantic Segmentation of 3D Point Clouds with Lyne Tchapmi - TWiML Talk #123: In this episode I’m joined by Lyne Tchapmi, PhD s…
Podcast episode
Semantic Segmentation of 3D Point Clouds with Lyne Tchapmi - TWiML Talk #123: In this episode I’m joined by Lyne Tchapmi, PhD s…
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
9: Android Bytecode Optimisation with Emma: In this episode, Pascal and Mihaela chat with Emma about Redex, an open-source bytecode optimiser for Android apps. Emma talks about the importance and trade-offs of such optimisations and walks us through the basic steps of how Redex works and the...
Podcast episode
9: Android Bytecode Optimisation with Emma: In this episode, Pascal and Mihaela chat with Emma about Redex, an open-source bytecode optimiser for Android apps. Emma talks about the importance and trade-offs of such optimisations and walks us through the basic steps of how Redex works and the...
byMeta Tech Podcast
0 ratings
0% found this document useful
Episode 395: JSJ 390: Transposit with Adam Leventhal
Podcast episode
Episode 395: JSJ 390: Transposit with Adam Leventhal
byJavaScript Jabber
0 ratings
0% found this document useful
BitDelta: Your Fine-Tune May Only Be Worth One Bit: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning ad...
Podcast episode
BitDelta: Your Fine-Tune May Only Be Worth One Bit: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning ad...
byPapers Read on AI
0 ratings
0% found this document useful
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization: The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art l...
Podcast episode
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization: The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art l...
byPapers Read on AI
0 ratings
0% found this document useful
Working with Kubernetes and KRM with Megan O'Keefe: This week on the podcast, we welcome guest Megan O’Keefe to talk about KRM and Kubernetes with your hosts Mark Mirchandani and Anthony Bushong.
Podcast episode
Working with Kubernetes and KRM with Megan O'Keefe: This week on the podcast, we welcome guest Megan O’Keefe to talk about KRM and Kubernetes with your hosts Mark Mirchandani and Anthony Bushong.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Episode 403: JSJ 398: Node 12 with Paige Niedringhaus
Podcast episode
Episode 403: JSJ 398: Node 12 with Paige Niedringhaus
byJavaScript Jabber
0 ratings
0% found this document useful
The Cloudcast #335 - Managing Waste in the Cloud: Brian talks with Andy Richman (Product Manager @ParkMyCloud) about which groups are responsible for cloud costs, how expectations and behaviors are changing as applications move to the cloud (Finance vs. DevOps), do we need DevFinOps, the technology th...
Podcast episode
The Cloudcast #335 - Managing Waste in the Cloud: Brian talks with Andy Richman (Product Manager @ParkMyCloud) about which groups are responsible for cloud costs, how expectations and behaviors are changing as applications move to the cloud (Finance vs. DevOps), do we need DevFinOps, the technology th...
byThe Cloudcast
0 ratings
0% found this document useful
Streaming alternatives to Kafka
Podcast episode
Streaming alternatives to Kafka
byThe Cloudcast
0 ratings
0% found this document useful

Skip carousel

Microcontrollers In Amateur Radio
CQ Amateur Radio
Article
Microcontrollers In Amateur Radio
Feb 1, 2023
3 min read
Comparing Time Series Data Like A Pro
Linux Format
Article
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
MacWorld
Article
Soulver 3: Mac App Simplifies Readable Calculations And Conversions
Nov 19, 2019
3 min read
3d Animation: Create Fire And Smoke
3D World
Article
3d Animation: Create Fire And Smoke
Sep 11, 2019
3 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
Experiments In Photogrammetry
British Columbia History
Article
Experiments In Photogrammetry
Jun 15, 2023
Ever since the fire of June 30, 2021, destroyed the Lytton Museum and Archives, I have been trying to assemble preservation methods designed to reduce the effect of another catastrop loss. To this end, I have been studying ways of making digital thre
2 min read
Microcontrollers In Amateur Radio
CQ Amateur Radio
Article
Microcontrollers In Amateur Radio
May 1, 2022
When you hit the compile button for your compiler, there’s a whole bunch of stuff that takes place that isn’t obvious while the code compiles. In general terms, the C compiler: 1) invokes a preprocessor pass on the code;2) performs syntax/semantic ch
4 min read
Clarisse 4.0
3D World
Article
Clarisse 4.0
Apr 17, 2019
PRICE Studio: $2,299 / Indie: $999 | DEVELOPER Isotropix | WEBSITE www.isotropix.com AUTHOR PROFILE Cirstyn Bech-Yagher Cirstyn has moved from Radeon’s ProRender to the RizomUV team, where she does product management as well as modelling, UV mapping
3 min read
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Union of Concerned Scientists
Article
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Apr 25, 2022
6 min read
Clever CAD Coding For Clients And Cigars
Linux Format
Article
Clever CAD Coding For Clients And Cigars
Apr 2, 2024
Credit: http://openscad.org Tam Hanna’s minimal creative capability makes him ideally suited to teaching all kinds of workarounds for problems that require the use of creativity. Catch up by ordering back issues on page 58! The experiments performed
7 min read
How To Use Mojolicious For Web Scraping
Linux Format
Article
How To Use Mojolicious For Web Scraping
Mar 8, 2022
Part One Don’t miss next issue! Subscribe on page 16 Mark Gardner is a software developer and blogger with over 25 years of IT experience. You can reach him at www.phoenixtrap.com and @markjgardner. The map function is designed to transform a list or
5 min read
How To Use Mojolicious For Web Scraping
Linux Format
Article
How To Use Mojolicious For Web Scraping
Mar 8, 2022
Part One Don’t miss next issue! Subscribe on page 16 Mark Gardner is a software developer and blogger with over 25 years of IT experience. You can reach him at www.phoenixtrap.com and @markjgardner. The map function is designed to transform a list or
5 min read
Loop The Loop
Racecar Engineering
Article
Loop The Loop
Oct 1, 2021
5 min read
A.i. Coding
Linux Format
Article
A.i. Coding
Aug 22, 2023
16 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
Top 10 Excel Functions That Everyone Should Know
Techfastly
Article
Top 10 Excel Functions That Everyone Should Know
Feb 4, 2021
5 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Create A RESTful Server In Go
Linux Format
Article
Create A RESTful Server In Go
Oct 19, 2021
8 min read
Collect And Graph Metrics With Python
Linux Format
Article
Collect And Graph Metrics With Python
May 4, 2021
7 min read
Is It Possible To Render On My iPad?
3D World
Article
Is It Possible To Render On My iPad?
Jan 30, 2024
2 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Create An Advertising Illustration
3D World
Article
Create An Advertising Illustration
Apr 22, 2020
8 min read
How To Develop A RESTful Client In Go
Linux Format
Article
How To Develop A RESTful Client In Go
Nov 16, 2021
Mihalis Tsoukalos is a systems engineer and technical writer. He’s the author of Go Systems Programming and Mastering Go. You can reach him at @mactsouk. The subject of this month’s tutorial is RESTful services. In particular, you’re going to learn h
9 min read
FRACTALS Going beyond the Mandelbrot Set
Linux Format
Article
FRACTALS Going beyond the Mandelbrot Set
Jul 2, 2019
10 min read
How To Develop Multi-threaded Code
Linux Format
Article
How To Develop Multi-threaded Code
Jul 26, 2022
Get the code for this tutorial from the Linux Format archive: www. linuxformat. com/archives ?issue=292. You can learn more about Rust at www. rust-lang.org. This month’s instalment of our ongoing Rust series will cover concurrent programming. The di
10 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
Database Control With C++ Tools
Linux Format
Article
Database Control With C++ Tools
Dec 17, 2019
10 min read
Website And RSS Feed Python Scraping
Linux Format
Article
Website And RSS Feed Python Scraping
Oct 18, 2022
Matt Holder has worked in IT support for over a decade, and is keen to utilise Linux alongside other installed systems. All the Python scripts that we’ve discussed in this tutorial are all available at https://github.com/mattmole/LXF295. Before we b
8 min read

Related categories

Skip carousel

Reviews for Text Mining in Practice with R

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Text Mining in Practice with R - Ted Kwartler

Chapter 1

What is Text Mining?

In this chapter, you will learn

the basic definition of practical text mining

why text mining is important to the modern enterprise

examples of text mining used in enterprise

the challenges facing text mining

an example workflow for processing natural language in analytical contexts

a simple text mining example

when text mining is appropriate

Learning how to perform text mining should be an interesting and exciting journey throughout this book. A fun artifact of learning text mining is that you can use the methods in this book on your own social media or online exchanges. Beyond these everyday online applications to your personal interactions, this book provides business use cases in an effort to show how text mining can improve products, customer service, marketing or human resources.

1.1 What is it?

There are many technical definitions of text mining both on the Internet and in textbooks, but as the primary goal of text mining in this book is the extraction of an output that is useful such as a visualization or structured table of outputs to be used elsewhere; this is my definition:

Text mining is the process of distilling actionable insights from text.

Text mining within the context of this book is a commitment to real world cases which impact business. Therefore, the definition and this book are aimed at meaningful distillation of text with the end goal to aid a decision-maker. While there may be some differences, the terms text mining and text analytics can be used interchangeably. Word choice is important; I use text mining because it more adequately describes the uncovering of insights and the use of specific algorithms beyond basic statistical analysis.

1.1.1 What is Text Mining in Practice?

In this book, text mining is more than an academic exercise. I hope to show that text mining has enterprise value and can contribute to various business units. Specifically, text mining can be used to identify actionable social media posts for a customer service organization. It can be used in human resources for various purposes such as understanding candidate perceptions of the organization or to match job descriptions with resumes. Text mining has marketing implications to measure campaign salience. It can even be used to identify brand evangelists and impact customer propensity modeling. Presently the state of text mining is somewhere between novelty and providing real actionable business intelligence. The book gives you not only the tools to perform text mining but also the case studies to help identify practical business applications to get your creative text mining efforts started.

1.1.2 Where Does Text Mining Fit?

Text mining fits within many disciplines. These include private and academic uses. For academics, text mining may aid in the analytical understanding of qualitatively collected transcripts or the study of language and sociology. For the private enterprise, text mining skills are often contained in a data science team. This is because text mining may yield interesting and important inputs for predictive modeling, and also because the text mining skillset has been highly technical. However, text mining can be applied beyond a data science modeling workflow. Business intelligence could benefit from the skill set by quickly reviewing internal documents such as customer satisfaction surveys. Competitive intelligence and marketers can review external text to provide insightful recommendations to the organization. As businesses are saving more textual data, they will need to break text-mining skills outside of a data science team. In the end, text mining could be used in any data driven decision where text naturally fits as an input.

1.2 Why We Care About Text Mining

We should care about textual information for a variety of reasons.

Social media continues to evolve and affect an organization's public efforts.

Online content from an organization, its competitors and outside sources, such as blogs, continues to grow.

The digitization of formerly paper records is occurring in many legacy industries, such as healthcare.

New technologies like automatic audio transcription are helping to capture customer touchpoints.

As textual sources grow in quantity, complexity and number of sources, the concurrent advance in processing power and storage has translated to vast amounts of text being stored throughout an enterprise's data lake.

Yet today's successful technology companies largely rely on numeric and categorical inputs for information gains, machine learning algorithms or operational optimization. It is illogical for an organization to study only structured information yet still devote precious resources to recording unstructured natural language. Text represents an untapped input that can further increase competitive advantage. Lastly, enterprises are transitioning from an industrial age to an information age; one could argue that the most successful companies are transitioning again to a customer-centric age. These companies realize that taking a long term view of customer wellbeing ensures long term success and helps the company to remain salient. Large companies can no longer merely create a product and forcibly market it to end-users. In an age of increasing customer expectations customers want to be heard by corporations. As a result, to be truly customer centric in a hyper competitive environment, an organization should be listening to their constituents whenever possible. Yet the amount of textual information from these interactions can be immense, so text mining offers a way to extract insights quickly.

Text mining will make an analyst's or data scientist's efforts to understand vast amounts of text easier and help ensure credibility from internal decision-makers. The alternative to text mining may mean ignoring text sources or merely sampling and manually reviewing text.

1.2.1 What Are the Consequences of Ignoring Text?

There are numerous consequences of ignoring text.

Ignoring text is not an adequate response of an analytical endeavor. Rigorous scientific and analytical exploration requires investigating sources of information that can explain phenomena.

Not performing text mining may lead an analysis to a false outcome.

Some problems are almost entirely text-based, so not using these methods would mean significant reduction in effectiveness or even not being able to perform the analysis.

Explicitly ignoring text may be a conscious analyst decision, but doing so ignores text's insightful possibilities. This is analogous to an ostrich that sticks its head in the ground when confronted. If the aim is robust investigative quantitative analysis, then ignoring text is inappropriate. Of course, there are constraints to data science or business analysis, such as strict budgets or timelines. Therefore, it is not always appropriate to use text for analytics, but if the problem being investigated has a text component, and resource constraints do not forbid it, then ignoring text is not suitable.

Wisdom of Crowds 1.1

As an alternative, some organizations will sample text and manually review it. This may mean having a single assessor or panel of readers or even outsourcing analytical efforts to human-based services like mturk or crowdflower. Often communication theory does not support these methods as a sound way to score text, or to extract meaning. Setting aside sampling biases and logistical tabulation difficulties, communication theory states that the meaning of a message relies on the recipient. Therefore a single evaluator introduces biases in meaning or numerical scoring, e.g. sentiment as a numbered scale. Additionally, the idea behind a group of people scoring text relies on Sir Francis Galton's theory of Vox Populi or wisdom of crowds.

To exploit the wisdom of crowds four elements must be considered:

Assessors need to exercise independent judgments.

Assessors need to possess a diverse information understanding.

Assessors need to rely on local knowledge.

There has to be a way to tabulate the assessors' results.

Sir Francis Galton's experiment exploring the wisdom of crowds met these conditions with 800 participants. At an English country fair, people were asked to guess the weight of a single ox. Participants guessed separately from each other without sharing the guess. Participants were free to look at the cow themselves yet not receive expert consultation. In this case, contestants had a diverse background. For example, there were no prerequisites stating that they needed to be a certain age, demographic or profession. Lastly, guesses were recorded on paper for tabulation by Sir Francis to study. In the end, the experiment showed the merit of the wisdom of crowds. There was not an individual correct guess. However, the median average of the group was exactly right. It was even better than the individual farming experts who guessed the weight.

If these conditions are not met explicitly, then the results of the panel are suspect. This may seem easy to do, but in practice it is hard to ensure within an organization. For example a former colleague at a major technology company in California shared a story about the company's effort to create Internet-connected eyeglasses. The eyeglasses were shared with internal employees, and feedback was then solicited. The text feedback was sampled and scored by internal employees. At first blush this seems like a fair assessment of the product's features and expected popularity. However, the conditions for the wisdom of crowds were not met. Most notably, the need for a decentralized understanding of the question was not met. As members of the same technology company, the respondents are already part of a self-selected group that understood the importance of the overall project within the company. Additionally, the panel had a similar assessment bias because they were from the same division that was working on the project. This assessing group did not satisfy the need for independent opinions when assessing the resulting surveys. Further, if a panel is creating summary text as the output of the reviews, then the effort is merely an information reduction effort similar to numerically taking an average. Thus it may not solve the problem of too much text in a reliable manner. Text mining solves all these problems. It will use all of the presented text and does so in a logical, repeatable and auditable way. There may be analyst or data scientist biases but they are documented in the effort and are therefore reviewable. In contrast, crowd-based reviewer assessments are usually not reviewable.

Despite the pitfalls of ignoring text or using a non-scientific sampling method, text mining offers benefits. Text mining technologies are evolving to meet the demands of the organization and provide benefits leading to data-driven decisions. Throughout this book, I will focus on benefits and applied applications of text mining in business.

1.2.2 What Are the Benefits of Text Mining?

There are many benefits of text mining including:

Trust is engendered among stakeholders because little to no sampling is needed to extract information.

The methodologies can be applied quickly.

Using R allows for auditable and repeatable methods.

Text mining identifies novel insights or reinforces existing perceptions based on all relevant information.

Interestingly, text mining first appears in the Gartner Hype Cycle in 2012. At that moment, it was listed in the trough of disillusionment. In subsequent years, it has not been listed on the cycle at all, leading me to believe that text analysis is either at a steady enterprise use state or has been abandoned by enterprises as not useful. Despite not being listed, text mining is used across industries and in various manners. It may not have exceeded the over-hyped potential of 2012's Gartner Hype Cycle, but text is showing merit. Hospitals use text mining of doctors' notes to understand readmission characteristics of patients. Financial and insurance companies use text to identify compliance risks. Retailers use customer service notes to make operational changes when failing customer expectations. Technology product companies use text mining to seek out feature requests in online reviews. Marketing is a natural fit for text analysis. For example, marketing companies monitor social media to identify brand evangelists. Human resource analytics efforts focus on resume text to match to job description text. As described here, mastering text mining is a skill set sought out across verticals and is therefore a worthwhile professional endeavor. Figure 1.1 shows possible business units that can benefit from text mining in some form.

Scheme for Possible enterprise uses of text min.

Figure 1.1 Possible enterprise uses of text min.

1.2.3 Setting Expectations: When Text Mining Should (and Should Not) Be Used

Since text is often a large part of a company's database, it is believed that text mining will lead to ground-breaking discoveries or significant optimization. As a result, senior leaders in an organization will devote resources to text mining, expecting to yield extensive results. Often specialists are hired, and resources are explicitly devoted to text mining. Outside of text mining software, in this case R, it is best to use text mining only in cases where it naturally fits the business objective and problem definition. For example, at a previous employer, I wondered how prospective employees viewed our organization compared to peer organizations. Since these candidates were outside the organization, capturing numerical or personal information such as age or company-related perspective scoring was difficult. However, there are forums and interview reviews anonymously shared online. These are shared as text so naturally text mining was an appropriate tool. When using text mining, you should prioritize defining the problem and reviewing applicable data, not using an exotic text mining method. Text mining is not an end in itself and should be regarded as another tool in an analyst's or data scientist's toolkit.

Text mining cannot distill large amounts of text to gain an absolute view of the truth. Text mining is part art and part science. An analyst can mislead stakeholders by removing certain words or using only specific methods. Thus, it is important to be up front about the limitations of text mining. It does not reveal an absolute truth contained within the text. Just as an average reduces information for consumption of a large set of numbers, text mining will reduce information. Sometimes it confirms previously held beliefs and sometimes it provides novel insights. Similar to numeric dimension reduction techniques, text mining abridges outliers, low frequency phrases and important information. It is important to understand that language is more colorful and diverse in understanding than numerical or strict categorical data. This poses a significant problem for text miners. Stakeholders need to be wary of any text miner who knows a truth solely based on the algorithms in this book. Rather, the methods in this book can help with the narrative of the data and the problem at hand, or the outputs can even be used in supervised learning alongside numeric data to improve the predictive outcomes. If doing predictive modeling using text, a best practice when modeling alongside non-text data features is to model with and without the text in the attribute set. Text is so diverse that it may even add noise to predictive efforts. Table 1.1 refers to actual use cases where text mining may be appropriate.

Table 1.1 Example use cases and recommendations to use or not use text mining.

Another suggestion for effective text mining is to avoid over using a word cloud. Analysts armed with the knowledge of this book should not create a word cloud without a need for it. This is because word clouds are often used without need, and as a result they can actually diminish their impact. However, word clouds are popular and can be powerful in showing term frequency, among other things, such as the one in Figure 1.2, which runs over the text of this chapter. Throwing caution to the wind, it demonstrates a word cloud of terms in Chapter 1. It is not very insightful because, as expected, the terms text and mining are the most frequent and largest words in the cloud!

Illustration of gratuitous word cloud.

Figure 1.2 A gratuitous word cloud for Chapter 1.

In fact, word clouds are so popular that an entire chapter is devoted to various types of word clouds that can be insightful. However, many people consider word clouds a cliché, so their impact is fading. Also, word clouds represent a relatively easy way to mislead consumers of an analysis. In the end, they should be used in conjunction with other methods to confirm the correctness of a conclusion.

1.3 A Basic Workflow – How the Process Works

Text represents unstructured data that must be preprocessed into a structured manner. Features need to be defined and then extracted from the larger body of organized text known as a corpus. These extracted features are then analyzed. The chevron arrows in Figure 1.3 represent structured predefined steps that are applied to the unorganized text to reach the final output or conclusion. Overall Figure 1.3 is a high level workflow of a text mining project.

Scheme for Text mining from an unstructured state to a structured understandable state.

Figure 1.3 Text mining is the transition from an unstructured state to a structured understandable state.

The steps for text mining include:

1. Define the problem and specific goals. As with other analytical endeavors, it is not prudent to start searching for answers. This will disappoint decision-makers and could lead to incorrect outputs. As the practitioner, you need to acquire subject matter expertise sufficient to define the problem and the outcome in an appropriate manner.

2. Identify the text that needs to be collected. Text can be from within the organization or outside. Word choice varies between mediums like Twitter and print so care must be taken to explicitly select text that is appropriate to the problem definition. Chapter 9 covers places to get text beyond reading in files. The sources covered include basic web scraping, APIs and R's specific API libraries, like twitteR. Sources are covered later in the book so you can focus on the tools to text mine, without the additional burden of finding text to work on.

3. Organize the text. Once the appropriate text is identified, it is collected and organized into a corpus or collection of documents. Chapter 2 covers two types of text mining conceptually, and then demonstrates some preparation steps used in a bag of words text mining method.

4. Extract features. Creating features means preprocessing text for the specific analytical methodology being applied in the next step. Examples include making all text lowercase, or removing punctuation. The analytical technique in the next step and the problem definition dictate how the features are organized and used. Chapters 3 and 4 work on basic extraction to be used in visualizations or in a sentiment polarity score. These chapters are not performing heavy machine learning or technical analysis, but instead rely on simple information extraction such as word frequency.

5. Analyze. Apply the analytical technique to the prepared text. The goal of applying an analytical methodology is to gain an insight or a recommendation or to confirm existing knowledge about the problem. The analysis can be relatively simple, such as searching for a keyword, or it may be an extremely complex algorithm. Subsequent chapters require more in-depth analysis based on the prepared texts. A chapter is devoted to unsupervised machine learning to analyze possible topics. Another illustrates how to perform a supervised classification while another performs predictive modeling. Lastly you will switch from a bag of words method to syntactic parsing to find named entities such as people's names.

6. Reach an insight or recommendation. The end result of the analysis is to apply the output to the problem definition or expected goal. Sometimes this can be quite novel and unexpected, or it can confirm the previously held idea. If the output does not align to the defined problem or completely satisfy the intended goal, then the process becomes repetitious and can be changed at various steps. By focusing on real case studies that I have encountered, I hope to instill a sense of practical purpose to text mining. To that end, the case studies, the use of non-academic texts and the exercises of this book are meant to lead you to an insight or narrative about the issue being investigated. As you use the tools of this book on your own, my hope is that you will remember to lead your audience to a conclusion.

The distinct steps are often specific to the particular problem definition or analytical technique being applied. For example, if one is analyzing tweets, then removing retweets may be useful but it may not be needed in other text mining exploration. Using R for text mining means the processing steps are repeatable and auditable. An analyst can customize the preprocessing steps outlined throughout the book to improve the final output. The end result is an insight, a recommendation or may be used in another analysis. The R scripts in this book follow this transition from an unorganized state to an organized state, so it is important to recall this mental map.

The rest of the book follows this workflow and adds more context and examples along the way. For example, Chapter 2 examines the two main approaches to text mining and how to organize a collection of documents into a clean corpus. From there you start to extract features of the text that are relevant to the defined problem. Subsequent chapters add visualizations, such as word clouds, so that a data scientist can tell the analytical narrative in a compelling way to stakeholders. As you progress through the book the types and methods of extracted features or information grow in complexity because the defined problems get more complex. You quickly divert to covering sentiment polarity so you can understand Airbnb reviews. Using this information you will build compelling visualizations and know what qualities are part of a good Airbnb review. Then in Chapter 5 you learn topic modeling using machine learning. Topic modeling provides a means to understand the smaller topics associated within a collection of documents without reading the documents themselves. It can be useful for tagging documents relating to a subject. The next subject, document classification, is used often. You may be familiar with document classification because it is used in email inboxes to identify spam versus legitimate emails. In this book's example you are searching for clickbait from online headlines. Later you examine text as it relates to patient records to model how a hospital identifies diabetic readmission. Using this method, some hospitals use text to improve patient outcomes. In the same chapter you even examine movie reviews to predict box office success. In a subsequent chapter you switch from the basic bag of words methodology to syntactic parsing using the OpenNLP library. You will identify named entities, such as people, organizations and locations within Hillary Clinton's emails. This can be useful in legal proceedings in which the volume of documentation is large and the deadlines are tight. Marketers also use named entity recognition to understand what influencers are discussing. The remaining chapters refocus your attention back to some more basic principles at the top of the workflow, namely where to get text and how to read it into R. This will let you use the scripts in this book with text that is thought provoking to your own interests.

1.4 What Tools Do I Need to Get Started with This?

To get started in text mining you need a few tools. You should have access to a laptop or workstation with at least 4GB of RAM. All of the examples in this book have been tested on a Microsoft's Windows operating systems. RAM is important because R's processing is done in memory. This means that the objects being analyzed must be contained in the RAM memory. Also, having a high speed internet connection will aid in downloading the scripts, R library packages and example text data and for gathering text from various webpages. Lastly, the computer needs to have an installation of R and R Studio. The operating system of the computer should not matter because R has an installation for Microsoft, Linux and Mac.

1.5 A Simple Example

Online customer reviews can be beneficial to understanding customer perspectives about a product or service. Further, reviewers can sometimes leave feedback anonymously, allowing authors to be candid and direct. While this may lead to accurate portrayals of a product it may lead to keyboard courage or extremely biased opinions. I consider it a form of selection bias, meaning that the people that leave feedback may have strong convictions not indicative of the overall product or service's public perception. Text mining allows an enterprise to benchmark their product reviews and develop a more accurate understanding of some public perceptions. Approaches like topic modeling and polarity (positive and negative scoring) which are covered later in this book may be applied in this context. Scoring methods can be normalizedacross different mediums such as forums or print, and when done against a competing product, the results can be compelling.

Suppose you are a Nike employee and you want to know about how consumers are viewing the Nike Men's Roshe Run Shoes. The text mining steps to follow are:

1. Define the problem and specific goals. Using online reviews, identify overall positive or negative views. For negative reviews, identify a consistent cause of the poor review to be shared with the product manager and manufacturing personnel.

2. Identify the text that needs to be collected. There are running websites providing expert reviews, but since the shoes are mass market, a larger collection of general use reviews would be preferable. New additions come out annually, so old reviews may not be relevant to the current release. Thus, a shopping website like Amazon could provide hundreds of reviews, and since there is a timestamp on each review, the text can be limited to a particular timeframe.

3. Organize the text. Even though Amazon reviewers rate products with a number of stars, reviews with three or fewer stars may yield opportunities to improve. Web scraping all reviews into a simple csv with a review per row and the corresponding timestamp and number of stars in the next columns will allow the analysis to subset the corpus by these added dimensions.

4. Extract features. Reviews will need to be cleaned so that text features can be analyzed. For this simple example, this may mean removing common words with little benefit like shoe or nike, running a spellcheck and making all text lowercase.

5. Analyze. A very simple way to analyze clean text, discussed in an early chapter, is to scan for a specific group of keywords. The text-mining analyst may want to scan for words given their subject matter expertise. Since the analysis is about shoe problems one could scan for fit, rip or tear, narrow, wide, sole, or any other possible quality problem from reviews. Then summing each could provide an indication of the most problematic feature. Keep in mind that this is an extremely simple example and the chapters build in complexity and analytical rigor beyond this illustration.

6. Reach an insight or recommendation. Armed with this frequency analysis, a text miner could present findings to the product manager and manufacturing personnel that the top consumer issue could be narrow and fit. In practical application, it is best to offer more methodologies beyond keyword frequency, as support for a finding.

1.6 A Real World Use Case

It is regularly the case that marketers learn best practices from each other. Unlike in other professions many marketing efforts are available outside of the enterprise, and competitors can see the efforts easily. As a result, competitive intelligence in this space is rampant. It is also another reason why novel ideas are often copied and reused, and then the novel idea quickly loses salience with its intended audience. Text mining offers a quick way to understand the basics of a competitor's text-based public efforts.

When I worked at amazon.com, creating the social customer service team, we were obsessed with how others were doing it. We regularly read and reviewed other companies' replies

Enjoying the preview?

Page 1 of 1

Text Mining in Practice with R

About this ebook

Ted Kwartler

Related authors

Related to Text Mining in Practice with R

Related ebooks

Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for Text Mining in Practice with R

What did you think?

Book preview

Text Mining in Practice with R - Ted Kwartler

1.1 What is it?

1.1.1 What is Text Mining in Practice?

1.1.2 Where Does Text Mining Fit?

1.2 Why We Care About Text Mining

1.2.1 What Are the Consequences of Ignoring Text?

Wisdom of Crowds 1.1

1.2.2 What Are the Benefits of Text Mining?

1.2.3 Setting Expectations: When Text Mining Should (and Should Not) Be Used

1.3 A Basic Workflow – How the Process Works

1.4 What Tools Do I Need to Get Started with This?

1.5 A Simple Example

1.6 A Real World Use Case