Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS

Ebook685 pages5 hours

Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS

Name: Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS
Author: Dr. Goutam Chakraborty
ISBN: 9781612907871

By Dr. Goutam Chakraborty, Murali Pagolu and Satish Garla

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Big data: It's unstructured, it's coming at you fast, and there's lots of it. In fact, the majority of big data is text-oriented, thanks to the proliferation of online sources such as blogs, emails, and social media.

However, having big data means little if you can't leverage it with analytics. Now you can explore the large volumes of unstructured text data that your organization has collected with Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS.

This hands-on guide to text analytics using SAS provides detailed, step-by-step instructions and explanations on how to mine your text data for valuable insight. Through its comprehensive approach, you'll learn not just how to analyze your data, but how to collect, cleanse, organize, categorize, explore, and interpret it as well. Text Mining and Analysis also features an extensive set of case studies, so you can see examples of how the applications work with real-world data from a variety of industries.

Text analytics enables you to gain insights about your customers' behaviors and sentiments. Leverage your organization's text data, and use those insights for making better business decisions with Text Mining and Analysis.

This book is part of the SAS Press program.

Skip carousel

LanguageEnglish

PublisherSAS Institute

Release dateNov 22, 2014

ISBN9781612907871

Author

Dr. Goutam Chakraborty

Dr. Goutam Chakraborty has a B. Tech (Honors) in mechanical engineering from the Indian Institute of Technology, Kharagpur; a PGCGM from the Indian Institute of Management, Calcutta; and an MS in statistics and a PhD in marketing from the University of Iowa. He has held managerial positions with a subsidiary of Union Carbide, USA, and with a subsidiary of British American Tobacco, UK. He is a professor of marketing at Oklahoma State University, where he has taught business analytics, marketing analytics, data mining, advanced data mining, database marketing, new product development, advanced marketing research, web-business strategy, interactive marketing, and product management for more than 20 years. Goutam has presented numerous programs and workshops to executives, educators, and research professionals in the US, Europe, Asia, and the Middle East. He has won many teaching awards, including the SAS Distinguished Professor Award from SAS Institute, and he teaches the popular SAS Business Knowledge Series course, "Text Analytics and Sentiment Mining Using SAS." Goutam's research has been published in many scholarly journals, such as the Journal of Interactive Marketing, Journal of Advertising Research, Journal of Advertising, Journal of Business Research, and Industrial Marketing Management. He coauthored the book Contemporary Database Marketing. In addition, Goutam has served on the editorial review board of the Journal of Business Research and Journal of Academy of Marketing Science. He serves as a member of the SAS Customer Analytics Advisory Board and the JMP Discovery Summit Steering Committee. Goutam has also consulted extensively on issues related to developing digital business strategy, building and managing customer relationships, product development, and management and creation of e-business models with companies such as Aetna, Mercruiser, Thrifty Rent-A-Car, Berendsen Fluid Power, Globe Life Insurance, Vanguard Realtors, Hilti, and Love's Travel Stops. He is the founder of the SAS and OSU Data Mining Certificate program as well as the SAS and OSU Business Analytics Certificate program at Oklahoma State University.

Related authors

Skip carousel

Related to Text Mining and Analysis

Related ebooks

Skip carousel

Data Quality for Analytics Using SAS
Ebook
Data Quality for Analytics Using SAS
byGerhard Svolba
Rating: 4 out of 5 stars
4/5
Categorical Data Analysis Using SAS, Third Edition
Ebook
Categorical Data Analysis Using SAS, Third Edition
byMaura E. Stokes
Rating: 0 out of 5 stars
0 ratings
Mastering Predictive Analytics with R
Ebook
Mastering Predictive Analytics with R
byRui Miguel Forte
Rating: 4 out of 5 stars
4/5
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Ebook
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
byKim Chantala
Rating: 0 out of 5 stars
0 ratings
Python for SAS Users: A SAS-Oriented Introduction to Python
Ebook
Python for SAS Users: A SAS-Oriented Introduction to Python
byRandy Betancourt
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Learning Social Media Analytics with R
Ebook
Learning Social Media Analytics with R
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
SAS Viya: The R Perspective
Ebook
SAS Viya: The R Perspective
byYue Qi
Rating: 0 out of 5 stars
0 ratings
End-to-End Data Science with SAS: A Hands-On Programming Guide
Ebook
End-to-End Data Science with SAS: A Hands-On Programming Guide
byJames Gearheart
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Programming in SAS: A Case Studies Approach
Ebook
Fundamentals of Programming in SAS: A Case Studies Approach
byJames Blum
Rating: 0 out of 5 stars
0 ratings
Practical and Efficient SAS Programming: The Insider's Guide
Ebook
Practical and Efficient SAS Programming: The Insider's Guide
byMartha Messineo
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Numerical Applications with SAS
Ebook
Deep Learning for Numerical Applications with SAS
byHenry Bequet
Rating: 0 out of 5 stars
0 ratings
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
Ebook
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
byTricia Aanderud
Rating: 5 out of 5 stars
5/5
The SAS Programmer's PROC REPORT Handbook: ODS Companion
Ebook
The SAS Programmer's PROC REPORT Handbook: ODS Companion
byJane Eslinger
Rating: 0 out of 5 stars
0 ratings
Building a Recommendation System with R
Ebook
Building a Recommendation System with R
byGorakala Suresh K.
Rating: 0 out of 5 stars
0 ratings
Elementary Statistics Using SAS
Ebook
Elementary Statistics Using SAS
bySandra D. Schlotzhauer
Rating: 0 out of 5 stars
0 ratings
SAS Visual Analytics for SAS Viya
Ebook
SAS Visual Analytics for SAS Viya
bySAS Institute Inc.
Rating: 0 out of 5 stars
0 ratings
PROC SQL: Beyond the Basics Using SAS, Third Edition
Ebook
PROC SQL: Beyond the Basics Using SAS, Third Edition
byKirk Paul Lafler
Rating: 0 out of 5 stars
0 ratings
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
PROC DOCUMENT by Example Using SAS
Ebook
PROC DOCUMENT by Example Using SAS
byMichael Tuchman
Rating: 0 out of 5 stars
0 ratings
Exercises and Projects for The Little SAS Book, Sixth Edition
Ebook
Exercises and Projects for The Little SAS Book, Sixth Edition
byRebecca A. Ottesen
Rating: 0 out of 5 stars
0 ratings
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
Ebook
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
byMatthew Windham
Rating: 0 out of 5 stars
0 ratings
Learning Tableau
Ebook
Learning Tableau
byJoshua N. Milligan
Rating: 0 out of 5 stars
0 ratings
Practical Predictive Analytics
Ebook
Practical Predictive Analytics
byRalph Winters
Rating: 0 out of 5 stars
0 ratings
Introduction to Statistical and Machine Learning Methods for Data Science
Ebook
Introduction to Statistical and Machine Learning Methods for Data Science
byCarlos Andre Reis Pinheiro
Rating: 0 out of 5 stars
0 ratings
SAS Certification Prep Guide: Statistical Business Analysis Using SAS9
Ebook
SAS Certification Prep Guide: Statistical Business Analysis Using SAS9
byJoni N. Shreve, PhD
Rating: 0 out of 5 stars
0 ratings
Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner: A Beginner's Guide
Ebook
Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner: A Beginner's Guide
byOlivia Parr-Rud
Rating: 0 out of 5 stars
0 ratings
SAS Viya: The Python Perspective
Ebook
SAS Viya: The Python Perspective
byKevin D. Smith
Rating: 0 out of 5 stars
0 ratings
Simulating Data with SAS
Ebook
Simulating Data with SAS
byRick Wicklin
Rating: 0 out of 5 stars
0 ratings

Applications & Software For You

Skip carousel

Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
Ebook
Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
byKyle Brach
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Logic Pro X For Dummies
Ebook
Logic Pro X For Dummies
byGraham English
Rating: 0 out of 5 stars
0 ratings
Sound Design for Filmmakers: Film School Sound
Ebook
Sound Design for Filmmakers: Film School Sound
byMurray Stiller
Rating: 5 out of 5 stars
5/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
GarageBand For Dummies
Ebook
GarageBand For Dummies
byBob LeVitus
Rating: 5 out of 5 stars
5/5
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
Ebook
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
byScreech House
Rating: 3 out of 5 stars
3/5
Hilarious Jokes for Minecrafters: Mobs, Creepers, Skeletons, and More
Ebook
Hilarious Jokes for Minecrafters: Mobs, Creepers, Skeletons, and More
byMichele C. Hollow
Rating: 1 out of 5 stars
1/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
Ebook
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
byPaul Richards
Rating: 0 out of 5 stars
0 ratings
Adobe Photoshop: A Complete Course and Compendium of Features
Ebook
Adobe Photoshop: A Complete Course and Compendium of Features
byStephen Laskevitch
Rating: 5 out of 5 stars
5/5
The Little SAS Book: A Primer, Sixth Edition
Ebook
The Little SAS Book: A Primer, Sixth Edition
byLora D. Delwiche
Rating: 5 out of 5 stars
5/5
iPhone Photography For Dummies
Ebook
iPhone Photography For Dummies
byMark Hemmings
Rating: 0 out of 5 stars
0 ratings
Adobe Illustrator: A Complete Course and Compendium of Features
Ebook
Adobe Illustrator: A Complete Course and Compendium of Features
byJason Hoppe
Rating: 0 out of 5 stars
0 ratings
Blender 3D Basics Beginner's Guide Second Edition
Ebook
Blender 3D Basics Beginner's Guide Second Edition
byGordon Fisher
Rating: 5 out of 5 stars
5/5
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
Ebook
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
byCrystalynn Shelton
Rating: 0 out of 5 stars
0 ratings
Start Your Own Podcast Business: Your Step-By-Step Guide to Success
Ebook
Start Your Own Podcast Business: Your Step-By-Step Guide to Success
byThe Staff of Entrepreneur Media
Rating: 5 out of 5 stars
5/5
Experts' Guide to OneNote
Ebook
Experts' Guide to OneNote
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5
GarageBand Basics: The Complete Guide to GarageBand: Music
Ebook
GarageBand Basics: The Complete Guide to GarageBand: Music
byAventuras De Viaje
Rating: 0 out of 5 stars
0 ratings
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
Ebook
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
byEMC Education Services
Rating: 0 out of 5 stars
0 ratings
Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing
Ebook
Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing
byLois Alba
Rating: 4 out of 5 stars
4/5
Affinity Photo How To
Ebook
Affinity Photo How To
byRobin Whalley
Rating: 0 out of 5 stars
0 ratings
How Do I Do That In InDesign?
Ebook
How Do I Do That In InDesign?
byDave Clayton
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT
Ebook
Mastering ChatGPT
byCharles J. Jones
Rating: 0 out of 5 stars
0 ratings
Six Figure Blogging In 3 Months
Ebook
Six Figure Blogging In 3 Months
byShekhar Mishra
Rating: 4 out of 5 stars
4/5
Adobe InDesign CC: A Complete Course and Compendium of Features
Ebook
Adobe InDesign CC: A Complete Course and Compendium of Features
byStephen Laskevitch
Rating: 0 out of 5 stars
0 ratings
iPhone X Hacks, Tips and Tricks: Discover 101 Awesome Tips and Tricks for iPhone XS, XS Max and iPhone X
Ebook
iPhone X Hacks, Tips and Tricks: Discover 101 Awesome Tips and Tricks for iPhone XS, XS Max and iPhone X
byDavid Cromwell
Rating: 3 out of 5 stars
3/5
FL Studio Cookbook
Ebook
FL Studio Cookbook
byShaun Friedman
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
Podcast episode
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
Podcast episode
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
byData Engineering Podcast
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
Podcast episode
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
byData Engineering Podcast
0 ratings
0% found this document useful
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Podcast episode
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
byData Engineering Podcast
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
Podcast episode
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
Podcast episode
Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service: A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team.
byData Engineering Podcast
0 ratings
0% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
Podcast episode
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
Podcast episode
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
byData Engineering Podcast
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
BAM 068: Tagging, Data Models, and Data Normalization: Lately, there have been a ton of questions about data tagging and data normalization. But what does that actually mean? This is a super important topic especially when you consider that having common data formats is required to implement analytics,...
Podcast episode
BAM 068: Tagging, Data Models, and Data Normalization: Lately, there have been a ton of questions about data tagging and data normalization. But what does that actually mean? This is a super important topic especially when you consider that having common data formats is required to implement analytics,...
byThe Smart Buildings Academy Podcast | Teaching You Building Automation, Systems Integration, and Information Technology
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
What Happened to Data Marts?
Podcast episode
What Happened to Data Marts?
byInsights Tomorrow
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Podcast episode
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
Podcast episode
Building Applications With Data As Code On The DataOS: The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Podcast episode
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
byData Engineering Podcast
0 ratings
0% found this document useful
How to pick projects for a professional data science team: This week's episodes is for data scientists, sure…
Podcast episode
How to pick projects for a professional data science team: This week's episodes is for data scientists, sure…
byLinear Digressions
0 ratings
0% found this document useful
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
Podcast episode
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
byData Engineering Podcast
0 ratings
0% found this document useful
The Modern SaaS Industry with Austin Petersmith
Podcast episode
The Modern SaaS Industry with Austin Petersmith
byFYI - For Your Innovation
0 ratings
0% found this document useful

Skip carousel

Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Understanding ELT & ETL
Techfastly
Article
Understanding ELT & ETL
Apr 1, 2021
8 min read
01 Giving Data Collectors—and Donors—a Real-Time Rush
Fast Company
Article
01 Giving Data Collectors—and Donors—a Real-Time Rush
Mar 20, 2017
7 min read
Data Analytics: From Bias to Better Decisions
Rotman Management
Article
Data Analytics: From Bias to Better Decisions
Sep 1, 2018
7 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
What European Banks Need to Know about Competing with Ecosystems
The European Business Review
Article
What European Banks Need to Know about Competing with Ecosystems
Dec 3, 2019
6 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
Beta Yourself Rss
Stuff UK
Article
Beta Yourself Rss
Mar 17, 2022
2 min read
Inside APC
APC
Article
Inside APC
Oct 31, 2022
2 min read
Beta Yourself Rss
Stuff Magazine South Africa
Article
Beta Yourself Rss
Apr 4, 2022
2 min read
The Future Of Cannabis Data
High Times
Article
The Future Of Cannabis Data
Jan 10, 2024
3 min read
Thriving As An Ecosystem Partner
The European Business Review
Article
Thriving As An Ecosystem Partner
Sep 30, 2022
Researching ecosystems that span industries from e-commerce and publishing to semiconductors and healthcare over the past decade, we found companies that have been successful for years by contributing to an ecosystem. Sometimes, by contributing as pa
10 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Your Data Is Yours
Linux Format
Article
Your Data Is Yours
May 4, 2021
“It’s nice to let someone else take over occasionally. For example, you might want to hand over database management to a service provider to free up capacity and focus on key business initiatives. Managed databases are increasingly popular – in our l
1 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Inside APC
APC
Article
Inside APC
Sep 5, 2022
APC is Australia’s oldest consumer technology magazine – having been consistently in print for forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on the
2 min read
Inside APC
APC
Article
Inside APC
Aug 8, 2022
APC is Australia’s oldest consumer technology magazine – having been consistently in print for forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on the
2 min read
Inside APC
APC
Article
Inside APC
Oct 3, 2022
2 min read
Business NAS appliances 2022
PC Pro Magazine
Article
Business NAS appliances 2022
Apr 10, 2022
4 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
HoudahSpot 5
MacLife
Article
HoudahSpot 5
Jun 25, 2019
2 min read
Your Next Steps
Linux Format
Article
Your Next Steps
Dec 15, 2020
There are many places you could take this going forwards. For reasons of space and readability, we’ve left out processing of other useful fields from the source XML file. As well as RatingValue , each business gets a score for ConfidenceInManagement
1 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
A Continuously Improving Workplace
Artichoke
Article
A Continuously Improving Workplace
Aug 27, 2017
3 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read

Related categories

Skip carousel

Reviews for Text Mining and Analysis

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Text Mining and Analysis - Dr. Goutam Chakraborty

Text Mining and Analysis

Practical Methods, Examples, and Case Studies Using SAS®

Goutam Chakraborty, Murali Pagolu, Satish Garla

support.sas.com/bookstore

The correct bibliographic citation for this manual is as follows: Chakraborty, Goutam, Murali Pagolu, and Satish Garla. 2013. Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS®. Cary, NC: SAS Institute Inc.

Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS®

ISBN 978-1-61290-787-1

For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.

U.S. Government Restricted Rights: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987).

SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.

November 2013

SAS provides a complete selection of books and electronic products to help customers use SAS® software to its fullest potential. For more information about our offerings, visit support.sas.com/bookstore or call 1-800-727-3228.

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

About This Book

About The Authors

Acknowledgments

Chapter 1 Introduction to Text Analytics

Overview of Text Analytics

Text Mining Using SAS Text Miner

Information Retrieval

Document Classification

Ontology Management

Information Extraction

Clustering

Trend Analysis

Enhancing Predictive Models Using Exploratory Text Mining

Sentiment Analysis

Emerging Directions

Handling Big (Text) Data

Voice Mining

Real-Time Text Analytics

Summary

References

Chapter 2 Information Extraction Using SAS Crawler

Introduction to Information Extraction and Organization

SAS Crawler

SAS Search and Indexing

SAS Information Retrieval Studio Interface

Web Crawler

Breadth First

Depth First

Web Crawling: Real-World Applications and Examples

Understanding Core Component Servers

Proxy Server

Pipeline Server

Component Servers of SAS Search and Indexing

Indexing Server

Query Server

Query Web Server

Query Statistics Server

SAS Markup Matcher Server

Summary

References

Chapter 3 Importing Textual Data into SAS Text Miner

An Introduction to SAS Enterprise Miner and SAS Text Miner

Data Types, Roles, and Levels in SAS Text Miner

Creating a Data Source in SAS Enterprise Miner

Importing Textual Data into SAS

Importing Data into SAS Text Miner Using the Text Import Node

%TMFILTER Macro

Importing XLS and XML Files into SAS Text Miner

Managing Text Using SAS Character Functions

Summary

References

Chapter 4 Parsing and Extracting Features

Introduction

Tokens and Words

Lemmatization

POS Tags

Parsing Tree

Text Parsing Node in SAS Text Miner

Stemming and Synonyms

Identifying Parts of Speech

Using Start and Stop Lists

Spell Checking

Entities

Building Custom Entities Using SAS Contextual Extraction Studio

Summary

References

Chapter 5 Data Transformation

Introduction

Zipf’s Law

Term-By-Document Matrix

Text Filter Node

Frequency Weightings

Term Weightings

Filtering Documents

Concept Links

Summary

References

Chapter 6 Clustering and Topic Extraction

Introduction

What Is Clustering?

Singular Value Decomposition and Latent Semantic Indexing

Topic Extraction

Scoring

Summary

References

Chapter 7 Content Management

Introduction

Content Categorization

Types of Taxonomy

Statistical Categorizer

Rule-Based Categorizer

Comparison of Statistical versus Rule-Based Categorizers

Determining Category Membership

Concept Extraction

Contextual Extraction

CLASSIFIER Definition

SEQUENCE and PREDICATE_RULE Definitions

Automatic Generation of Categorization Rules Using SAS Text Miner

Differences between Text Clustering and Content Categorization

Summary

Appendix

References

Chapter 8 Sentiment Analysis

Introduction

Basics of Sentiment Analysis

Challenges in Conducting Sentiment Analysis

Unsupervised versus Supervised Sentiment Classification

SAS Sentiment Analysis Studio Overview

Statistical Models in SAS Sentiment Analysis Studio

Rule-Based Models in SAS Sentiment Analysis Studio

SAS Text Miner and SAS Sentiment Analysis Studio

Summary

References

Case Studies

Case Study 1 Text Mining SUGI/SAS Global Forum Paper Abstracts to Reveal Trends

Introduction

Data

Results

Trends

Summary

Instructions for Accessing the Case Study Project

Case Study 2 Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions

Introduction

Objective

Step-by-Step Instructions

Summary

Case Study 3 Features-based Sentiment Analysis of Customer Reviews

Introduction

Data

Text Mining for Negative App Reviews

Text Mining for Positive App Reviews

NLP Based Sentiment Analysis

Summary

Case Study 4 Exploring Injury Data for Root Causal and Association Analysis

Introduction

Objective

Data Description

Step-by-Step Instructions

Part 1: SAS Text Miner

Part 2: SAS Enterprise Content Categorization

Summary

Case Study 5 Enhancing Predictive Models Using Textual Data

Data Description

Step-by-Step Instructions

Summary

Case Study 6 Opinion Mining of Professional Drivers’ Feedback

Introduction

Data

Analysis Using SAS® Text Miner

Analysis Using the Text Rule-builder Node

Summary

Case Study 7 Information Organization and Access of Enron Emails to Help Investigation

Introduction

Objective

Step-by-Step Software Instruction with Settings/Properties

Summary

Case Study 8 Unleashing the Power of Unified Text Analytics to Categorize Call Center Data

Introduction

Data Description

Examining Topics

Merging or Splitting Topics

Categorizing Content

Concept Map Visualization

Using PROC DS2 for Deployment DEPLOYMENT

Integrating with SAS® Visual Analytics

Summary

Case Study 9 Evaluating Health Provider Service Performance Using Textual Responses

Introduction

Summary

Index

About This Book

Purpose

Analytics is the key driver of how organizations make business decisions to gain competitive advantage. While the popular press has been abuzz with Big Data, we believe in it is the analysis, stupid. Having Big Data means little if that data is not leveraged via analytics to create better value for all stakeholders. One of the primary drivers of Big Data is the advent of social media that has exponentially increased the rate at which textual data is generated on the Internet and the World Wide Web. In addition to data generated via the Internet and the web, organizations have large repositories of textual data collected via forms, reports, customer surveys, voice-of-customers, call-center records and so on. There are numerous organizations that simply collect and store large volumes of unstructured text data, which are yet to be explored to uncover hidden nuggets of useful information that can benefit their business. However, there are not a lot of resources available that can efficiently handle text data for the business analyst community. This book is designed to help industries leverage their textual data and SAS tools to perform comprehensive text analytics.

Is This Book for You?

Typical readers are business analysts, data analysts, customer intelligence analysts, customer insights analysts, web analysts, social media analysts, students in professional courses related to analytics and SAS programmers. Anyone who wants to retrieve, organize, categorize, analyze, and interpret textual data for generating insights about customer and prospects’ behaviors, their sentiments and want to use such insights for making better decisions will find this book useful.

Prerequisites

While some familiarity with SAS products will be beneficial, this book is intended for anyone who is willing to learn how to apply text analytics using primarily the point-and-click interfaces of SAS Enterprise Miner, SAS Text Miner, SAS Content Categorization Studio, SAS Information Retrieval Studio and SAS Sentiment Analysis studio.

Software Used to Develop the Book’s Content

Below is a list of software used in this book. Be sure to check out the SAS website for updates and changes to the software. The SAS support website contains the latest Online Help documents that have enhancements and changes in new releases of the software.

• SAS® Enterprise Miner (Release 7.1 and Release 12.1)

• SAS® Text Miner (Release 4.1 and 5.1)

• SAS® Crawler, SAS® Search and Indexing (Release 12.1)

• SAS® Enterprise Content Categorization Studio (Release 12.1)

• SAS® Sentiment Analysis Studio (Release 12.1)

Note: SAS® Information Retrieval Studio is a graphical user interface (GUI) based framework using SAS® Crawler, SAS® Search and Indexing components that can be configured and maintained.

Example Code and Data

You can access the example code and data for this book at http://support.sas.com/publishing/authors. From this website select Goutam Chakraborty or Murali Pagolu or Satish Garla. Then look for the cover image of this book, and select Example Code and Data to download the SAS programs and data that are included in this book. The data and programs are organized by chapter and case study.

The case studies in this book contain step-by-step instructions for performing a specific type of analysis with the given data. A lot of text mining tasks are subjective and iterative. It is difficult to list each and every task performed in the analysis. The results that you see in your analysis when you follow the exact steps as listed in the case study might differ slightly from the screenshots in the case study. Hence, we also provide you with the SAS® Enterprise Miner projects that the case study authors created in their analysis. These projects can be accessed from the authors’ website.

For an alphabetical listing of all books for which example code and data is available, see http://support.sas.com/bookcode. Select a title to display the book’s example code.

If you are unable to access the code through the Web site, send e-mail to saspress@sas.com.

Additional Resources

SAS offers you a rich variety of resources to help build your SAS skills and explore and apply the full power of SAS software. Whether you are in a professional or academic setting, we have learning products that can help you maximize your investment in SAS.

Keep in Touch

We look forward to hearing from you. We invite questions, comments, and concerns. If you want to contact us about a specific book, please include the book title in your correspondence.

To Contact the Authors through SAS Press

By e-mail: saspress@sas.com

Via the Web: http://support.sas.com/author_feedback

SAS Books

For a complete list of books available through SAS, visit http://support.sas.com/bookstore.

Phone: 1-800-727-3228

Fax: 1-919-677-8166

E-mail: sasbook@sas.com

SAS Book Report

Receive up-to-date information about all new SAS publications via e-mail by subscribing to the SAS Book Report monthly eNewsletter. Visit http://support.sas.com/sbr.

About The Authors

Murali Pagolu is a Business Analytics Consultant at SAS and has four years of experience using SAS software in both academic research and business applications. His focus areas include database marketing, marketing research, data mining and customer relationship management (CRM) applications, customer segmentation, and text analytics. Murali is responsible for implementing analytical solutions and developing proofs of concept for SAS customers. He has presented innovative applications of text analytics, such as mining text comments from YouTube videos and patent portfolio analysis, at past SAS Analytics conferences. He currently holds six SAS certification credentials.

Satish Garla is an Analytical Consultant in Risk Practice at SAS. He has extensive experience in risk modeling for healthcare, predictive modeling, text analytics, and SAS programming. He has a distinguished academic background in analytics, databases, and business administration. Satish holds a master’s degree in Management Information Systems at Oklahoma State University and has completed the SAS and OSU Data Mining Certificate program. He is a SAS Certified Advanced Programmer for SAS 9 and a Certified Predictive Modeler using SAS Enterprise Miner 6.1.

Learn more about these authors by visiting their author pages, where you can download free chapters, access example code and data, read the latest reviews, get updates, and more:

http://support.sas.com/chakraborty

http://support.sas.com/pagolu

http://support.sas.com/garla

Acknowledgments

The authors would like to extend their gratitude to Radhika Kulkarni and Saratendu Sethi who have been consistently extending their support, encouragement and guidance throughout the development of this book. A special mention of the technical experts James Cox and Terry Woodfield for their invaluable input and suggestions that have greatly helped us in shaping the book. We also would like to thank Lise Cragen for her valuable input and suggestions.

We would like to express our appreciation and thanks to all of the technical reviewers of the book: Barry deVille, Fiona McNeill, Meilan Ji, Penny (Ping) Ye, Praveen Lakkaraju, Vivek Ajmani, Youqin Pan (Salem State University), and Zhongyi Liu for spending their precious time in reviewing the content for the book and providing constructive feedback.

We are also thankful to Arila Barnes, Dan Zaratsian, Gary Gaeth, Jared Peterson, Jiawen Liu, Maheshwar Nareddy, Mantosh Sarkar, Mary Osborne, Saratendu Sethi, and Zubair Shaikh for their valuable contributions to the case studies in this book.

We would also like to express our deepest gratitude to the SAS Publications Production team: Aimee Rodriguez, Amy Wolfe, Brenna Leath, Denise T. Jones, John West and Shelley Sessoms. Without their patience, help, advice and support through the thick and thin, this book would have never seen the light of day.

Chapter 1 Introduction to Text Analytics

Overview of Text Analytics

Text Mining Using SAS Text Miner

Information Retrieval

Document Classification

Ontology Management

Information Extraction

Clustering

Trend Analysis

Enhancing Predictive Models Using Exploratory Text Mining

Sentiment Analysis

Emerging Directions

Handling Big (Text) Data

Voice Mining

Real-Time Text Analytics

Summary

References

Overview of Text Analytics

Text analytics helps analysts extract meanings, patterns, and structure hidden in unstructured textual data. The information age has led to the development of a wide variety of tools and infrastructure to capture and store massive amounts of textual data. In a 2009 report, the International Data Corporation (IDC) estimated that approximately 80% percent of the data in an organization is text based. It is not practical for any individual (or group of individuals) to process huge textual data and extract meanings, sentiments, or patterns out of the data. A paper written by Hans Peter Luhn, titled The Automatic Creation of Literature Abstracts, is perhaps one of the earliest research projects conducted on text analytics. Luhn writes about applying machine methods to automatically generate an abstract for a document. In a traditional sense, the term text mining is used for automated machine learning and statistical methods that encompass a bag-of-words approach. This approach is typically used to examine content collections versus assessing individual documents. Over time, the term text analytics has evolved to encompass a loosely integrated framework by borrowing techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management.

Text analytics applications are popular in the business environment. These applications produce some of the most innovative and deeply insightful results. Text analytics is being implemented in many industries. There are new types of applications every day. In recent years, text analytics has been heavily used for discovering trends

in textual data. Using social media data, text analytics has been used for crime prevention and fraud detection. Hospitals are using text analytics to improve patient outcomes and provide better care. Scientists in the pharmaceutical industry are to mine biomedical literature to discover new drugs.

Text analytics incorporates tools and techniques that are used to derive insights from unstructured data. These techniques can be broadly classified as the following:

• information retrieval

• exploratory analysis

• concept extraction

• summarization

• categorization

• sentiment analysis

• content management

• ontology management

In these techniques, exploratory analysis, summarization, and categorization are in the domain of text mining. Exploratory analysis includes techniques such as topic extraction, cluster analysis, etc. The term text analytics is somewhat synonymous with text mining (or text data mining). Text mining can be best conceptualized as a subset of text analytics that is focused on applying data mining techniques in the domain of textual information using NLP and machine learning. Text mining considers only syntax (the study of structural relationships between words). It does not deal with phonetics, pragmatics, and discourse.

Sentiment analysis can be treated as classification analysis. Therefore, it is considered predictive text mining. At a high level, the application areas of these techniques divide the text analytics market into two areas: search and descriptive and predictive analytics. (See Display 1.1.) Search includes numerous information retrieval techniques, whereas descriptive and predictive analytics include text mining and sentiment analysis.

Display 1.1: High-Level Classification of Text Analytics Market and Corresponding SAS Tools

Display 1.1: High-Level Classification of Text Analytics Market and Corresponding SAS Tools

SAS has multiple tools to address a variety of text analytics techniques for a range of business applications. Display 1.1 shows the SAS tools that address different areas of text analytics. In a typical situation, you might need to use more than one tool for solving a text analytics problem. However, there is some overlap in the underlying features that some of these tools have to offer. Display 1.2 provides an integrated view of SAS Text Analytics tools. It shows, at a high level, how they are organized in terms of functionality and scope. SAS Crawler can extract content from the web, file systems, or feeds, and then send it as input to SAS Text Miner, SAS Sentiment Analysis Studio, or SAS Content Categorization. These tools are capable of sending content to the indexing server where information is indexed. The query server enables you to enter search queries and retrieve relevant information from the indexed content.

SAS Text Miner, SAS Sentiment Analysis Studio, and SAS Content Categorization form the core of the SAS Text Analytics tools arsenal for analyzing text data. NLP features such as tokenization, parts-of-speech recognition, stemming, noun group detection, and entity extraction are common among these tools. However, each of these tools has unique capabilities that differentiate them individually from the others. In the following section, the functionality and usefulness of these tools are explained in detail.

Display 1.2: SAS Text Analytics Tools: An Integrated Overview

Display 1.2: SAS Text Analytics Tools: An Integrated Overview

The following paragraphs briefly describe each tool from the SAS Text Analytics suite as presented in Display 1.2:

• SAS Crawler, SAS Search and Indexing – Useful for extracting textual content from the web or from documents stored locally in an organized way. For example, you can download news articles from websites and use SAS Text Miner to conduct an exploratory analysis, such as extracting key topics or themes from the news articles. You can build indexes and submit queries on indexed documents through a dedicated query interface.

• SAS Ontology Management – Useful for integrating existing document repositories in enterprises and identifying relationships between them. This tool can help subject matter experts in a knowledge domain create ontologies and establish hierarchical relationships of semantic terms to enhance the process of search and retrieval on the document repositories.

Note: SAS Ontology Management is not discussed in this book because we primarily focus on areas where the majority of current business applications are relevant for textual data.

• SAS Content Categorization – Useful for classifying a document collection into a structured hierarchy of categories and subcategories called taxonomy. In addition to categorizing documents, it can be used to extract facts from them. For example, news articles can be classified into a predefined set of categories such as politics, sports, business, financial, etc. Factual information such as events, places, names of people, dates, monetary values, etc., can be easily retrieved using this tool.

• SAS Text Miner – Useful for extracting the underlying key topics or themes in textual documents. This tool offers the capability to group similar documents—called clusters—based on terms and their frequency of occurrence in the corpus of documents and within each document. It provides a feature called concept linking to explore the relationships between terms and their strength of association.

For example, textual transcripts from a customer call center can be fed into this tool to automatically cluster the transcripts. Each cluster has a higher likelihood of having similar problems reported by customers. The specifics of the problems can be understood by reviewing the descriptive terms explaining each of the clusters. A pictorial representation of these problems and the associated terms, events, or people can be viewed through concept linking, which shows how strongly an event can be related to a problem.

SAS Text Miner enables the user to define custom topics or themes. Documents can be scored based on the presence of the custom topics. In the presence of a target variable, supervised classification or prediction models can be built using SAS Text Miner. The predictions of a prediction model with numerical inputs can be improved using topics, clusters, or rules that can be extracted from textual comments using SAS Text Miner.

• SAS Sentiment Analysis – Useful for identifying the sentiment toward an entity in a document or the overall sentiment toward the entire document. An entity can be anything, such as a product, an attribute of a product, brand, person, group, or even an organization. The sentiment evaluated is classified as positive or negative or neutral or unclassified. If there are no terms associated with an entity or the entire document that reflect the sentiment, it is tagged unclassified.

Sentiment analysis is generally applied to a class of textual information such as customers’ reviews on products, brands, organizations, etc., or to responses to public events such as presidential elections.

This type of information is largely available on social media sites such as Facebook, Twitter, YouTube, etc.

Text Mining Using SAS Text Miner

A typical predictive data mining problem deals with data in numerical form. However, textual data is typically available only in a readable document form. Forms could be e-mails, user comments, corporate reports, news articles, web pages, etc. Text mining attempts to first derive a quantitative representation of documents. Once the text is transformed into a set of numbers that adequately capture the patterns in the textual data, any traditional statistical or forecasting model or data mining algorithm can be used on the numbers for generating insights or for predictive modeling.

A typical text mining project involves the following tasks:

1. Data Collection: The first step in any text mining research project is to collect the textual data required for analysis.

2. Text Parsing and Transformation: The next step is to extract, clean, and create a dictionary of words from the documents using NLP. This includes identifying sentences, determining parts of speech, and stemming words. This step involves parsing the extracted words to identify entities, removing stop words, and spell-checking. In addition to extracting words from documents, variables associated with the text such as date, author, gender, category, etc., are retrieved.

The most important task after parsing is text transformation. This step deals with the numerical representation of the text using linear algebra-based methods, such as latent semantic analysis (LSA), latent semantic indexing (LSI), and vector space model. This exercise results in the creation of a term- by-document matrix (a spreadsheet or flat-like numeric representation of textual data as shown in Table 1.1). The dimensions of the matrix are determined by the number of documents and the number of terms in the collection. This step might involve dimension reduction of the term-by-document matrix using singular value decomposition (SVD).

Consider a collection of three reviews (documents) of a book as provided below: Document 1: I am an avid fan of this sport book. I love this book.

Document 2: This book is a must for athletes and sportsmen. Document 3: This book tells how to command the sport.

Parsing this document collection generates the following term-by-document matrix in Table 1.1:

Table 1.1: Term-By-Document Matrix

3. Text Filtering: In a corpus of several thousands of documents, you will likely have many terms that are irrelevant to either differentiating documents from each other or to summarizing the documents. You will have to manually browse through the terms to eliminate irrelevant terms. This is often one of the most time-consuming and subjective tasks in all of the text mining steps. It requires a fair amount of subject matter knowledge (or domain expertise). In addition to term filtering, documents irrelevant to the analysis are searched using keywords. Documents are filtered if they do not contain some of the terms or filtered based on one of the other document variables such as date, category, etc. Term filtering or document filtering alters the term-by-document matrix. As shown in Table 1.1, the term- by-document matrix contains the frequency of the occurrence of the term in the document as the presence of the term in a document as the value for each cell. From this frequency matrix, a matrix is generated using various term-weighting techniques.

4. Text Mining: This step involves applying traditional data mining algorithms such as clustering, classification, association analysis, and link analysis. As shown in Display 1.3, text mining is an iterative process, which involves repeating the analysis using different settings and including or excluding terms for better results. The outcome of this step can be clusters of documents, lists of single-term or multi-term topics, or rules that answer a classification problem. Each of these steps is discussed in detail in Chapter 3 to Chapter 7.

Display 1.3: Text Mining Process Flow

Display 1.3: Text Mining Process Flow

Information Retrieval

Information retrieval, commonly known as IR, is the study of searching and retrieving a subset of documents from a universe of document collections in response to a search query. The documents are often unstructured in nature and contain vast amounts of textual data. The documents retrieved should be relevant to the information needs of the user who performed the search query. Several applications of the IR process have evolved in the past decade. One of the most ubiquitously known is searching for information on the World Wide Web. There are many search engines such as Google, Bing, and Yahoo facilitating this process using a variety of advanced methods.

Most of the online digital libraries enable its users to search through their catalogs based on IR techniques. Many organizations enhance their websites with search capabilities to find documents, articles, and files of interest using keywords in the search queries. For example, the United States Patent and Trademark Office provides several ways of searching its database of patents and trademarks that it has made available to the public. In general, an IR system’s efficiency lies in its ability to match a user’s query with the most relevant documents in a corpus. To make the IR process more efficient, documents are required to be organized, metadata based on the original content of the documents. SAS Crawler is capable of pulling information from a wide variety of data sources. Documents are then processed by parsers to create various fields such as title, ID, URL, etc., which form the metadata of the documents. (See Display 1.4.) SAS Search and Indexing enables you to build indexes from these documents. Users can submit search queries on the indexes to retrieve information most relevant to the query terms. The metadata fields generated by the parsers can be used in the indexes to enable various types of functionality for querying.

Display 1.4: Overview of the IR Process with SAS Search and Indexing

Display 1.4: Overview of the IR Process with SAS Search and Indexing

Document Classification

Document classification is the process of finding commonalities in the documents in a corpus and grouping them into predetermined labels (supervised learning) based on the topical themes exhibited by the documents. Similar to the IR process, document classification (or text categorization) is an important aspect of text analytics and has numerous applications.

Some of the common applications of document classification are e-mail forwarding and spam detection, call center routing, and news articles categorization. It is not necessary that documents be assigned to mutually exclusive categories. Any restrictive approach to do so might prove to be an inefficient way of representing the information. In reality, a document can exhibit multiple themes, and it might not be possible to restrict them to only one category. SAS Text Miner contains the text topic feature, which is capable of handling these situations. It assigns a document to more than one category if needed. (See Display 1.5.) Restricting documents to only one category might be difficult for large documents, which have a greater chance of containing multiple topics or features. Topics or categories can be either automatically generated by SAS Text Miner or predefined manually based on the knowledge of the document content.

In cases where a document should be restricted to only one category, text clustering is usually a better approach instead of extracting text topics. For example, an analyst could gain an understanding of a collection of classified ads when the clustering algorithm reveals the collection actually consists of categories such as Car Sales, Real Estate, and Employment Opportunities.

Display 1.5: Text Categorization Involving Multiple Categories per Document

Display 1.5: Text Categorization Involving Multiple Categories per Document

SAS Content Categorization helps automatically categorize multilingual content available in huge volumes that is acquired or generated or that exists in It has the capability to parse, analyze, and extract content such as entities, facts, and events in a classification hierarchy. Document classification can be achieved using either SAS Content Categorization or SAS Text Miner. However, there are some fundamental differences between these two tools. The text topic extraction feature in SAS Text Miner completely relies on the quantification of terms (frequency of occurrences) and the derived weights of the terms for each document

Enjoying the preview?

Page 1 of 1

Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS

About this ebook

Dr. Goutam Chakraborty

Related authors

Related to Text Mining and Analysis

Related ebooks

Applications & Software For You

Related podcast episodes

Related articles

Related categories

Reviews for Text Mining and Analysis

What did you think?

Book preview

Text Mining and Analysis - Dr. Goutam Chakraborty

Text Mining and Analysis

Practical Methods, Examples, and Case Studies Using SAS®

Goutam Chakraborty, Murali Pagolu, Satish Garla

Contents

About This Book

About The Authors

Acknowledgments

Chapter 1 Introduction to Text Analytics

Overview of Text Analytics

Text Mining Using SAS Text Miner

Information Retrieval

Document Classification