Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition

Ebook408 pages2 hours

Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition

Name: Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition
Author: Mark Jordan
ISBN: 9781635266061

By Mark Jordan

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Enhance your SAS data-wrangling skills with high-precision and parallel data manipulation using the DS2 programming language.

Now in its second edition, this book addresses the DS2 programming language from SAS, which combines the precise procedural power and control of the Base SAS DATA step language with the simplicity and flexibility of SQL. DS2 provides simple, safe syntax for performing complex data transformations in parallel and enables manipulation of native database data types at full precision. It also covers PROC FEDSQL, a modernized SQL language that blends perfectly with DS2. You will learn to harness the power of parallel processing to speed up CPU-intensive computing processes in Base SAS and how to achieve even more speed by processing DS2 programs on massively parallel database systems. Techniques for leveraging internet APIs to acquire data, avoiding large data movements when working with data from disparate sources, and leveraging DS2's new data types for full-precision numeric calculations are presented, with examples of why these techniques are essential for the modern data wrangler.

Here's what's new in this edition:

how to significantly improve performance by using the new SAS Viya architecture with its SAS Cloud Analytic Services (CAS)
how to declare private variables and methods in a package
the new PROC DSTODS2
the PCRXFIND and PCRXREPLACE packages

While working though the code samples provided with this book, you will build a library of custom, reusable, and easily shareable DS2 program modules, execute parallelized DATA step programs to speed up a CPU-intensive process, and conduct advanced data transformations using hash objects and matrix math operations.

This book is part of the SAS Press Series.

Skip carousel

LanguageEnglish

PublisherSAS Institute

Release dateMar 23, 2018

ISBN9781635266061

Author

Mark Jordan

Mark Jordan is Head of Library Systems at Simon Fraser University, Canada, and has published widely.

Related authors

Skip carousel

Related to Mastering the SAS DS2 Procedure

Related ebooks

Skip carousel

End-to-End Data Science with SAS: A Hands-On Programming Guide
Ebook
End-to-End Data Science with SAS: A Hands-On Programming Guide
byJames Gearheart
Rating: 0 out of 5 stars
0 ratings
PROC REPORT by Example: Techniques for Building Professional Reports Using SAS: Techniques for Building Professional Reports Using SAS
Ebook
PROC REPORT by Example: Techniques for Building Professional Reports Using SAS: Techniques for Building Professional Reports Using SAS
byLisa Fine
Rating: 0 out of 5 stars
0 ratings
Machine Learning with SAS Viya
Ebook
Machine Learning with SAS Viya
bySAS Institute Inc.
Rating: 0 out of 5 stars
0 ratings
SAS Viya: The Python Perspective
Ebook
SAS Viya: The Python Perspective
byKevin D. Smith
Rating: 0 out of 5 stars
0 ratings
PROC SQL: Beyond the Basics Using SAS, Third Edition
Ebook
PROC SQL: Beyond the Basics Using SAS, Third Edition
byKirk Paul Lafler
Rating: 0 out of 5 stars
0 ratings
Advanced SQL with SAS
Ebook
Advanced SQL with SAS
byChristian FG Schendera
Rating: 0 out of 5 stars
0 ratings
Applied Data Mining for Forecasting Using SAS
Ebook
Applied Data Mining for Forecasting Using SAS
byTim Rey
Rating: 0 out of 5 stars
0 ratings
Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study
Ebook
Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study
byPaul Dorfman
Rating: 0 out of 5 stars
0 ratings
Practical and Efficient SAS Programming: The Insider's Guide
Ebook
Practical and Efficient SAS Programming: The Insider's Guide
byMartha Messineo
Rating: 0 out of 5 stars
0 ratings
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ebook
Cody's Data Cleaning Techniques Using SAS, Third Edition
byRon Cody
Rating: 5 out of 5 stars
5/5
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
Ebook
The Data Detective's Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data
byKim Chantala
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Numerical Applications with SAS
Ebook
Deep Learning for Numerical Applications with SAS
byHenry Bequet
Rating: 0 out of 5 stars
0 ratings
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
Ebook
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS
byMatthew Windham
Rating: 0 out of 5 stars
0 ratings
Carpenter's Guide to Innovative SAS Techniques
Ebook
Carpenter's Guide to Innovative SAS Techniques
byArt Carpenter
Rating: 0 out of 5 stars
0 ratings
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
Ebook
An Introduction to SAS Visual Analytics: How to Explore Numbers, Design Reports, and Gain Insight into Your Data
byTricia Aanderud
Rating: 5 out of 5 stars
5/5
The SAS Programmer's PROC REPORT Handbook: ODS Companion
Ebook
The SAS Programmer's PROC REPORT Handbook: ODS Companion
byJane Eslinger
Rating: 0 out of 5 stars
0 ratings
SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4
Ebook
SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4
bySAS Institute
Rating: 1 out of 5 stars
1/5
SAS Viya: The R Perspective
Ebook
SAS Viya: The R Perspective
byYue Qi
Rating: 0 out of 5 stars
0 ratings
Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications, Third Edition
Ebook
Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications, Third Edition
byKattamuri S. Sarma
Rating: 0 out of 5 stars
0 ratings
SAS Certified Specialist Prep Guide: Base Programming Using SAS 9.4
Ebook
SAS Certified Specialist Prep Guide: Base Programming Using SAS 9.4
bySAS Institute
Rating: 4 out of 5 stars
4/5
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
SAS Macro Programming Made Easy, Third Edition
Ebook
SAS Macro Programming Made Easy, Third Edition
byMichele M. Burlew
Rating: 3 out of 5 stars
3/5
Getting Started with SAS Programming: Using SAS Studio in the Cloud
Ebook
Getting Started with SAS Programming: Using SAS Studio in the Cloud
byRon Cody
Rating: 0 out of 5 stars
0 ratings
PROC DOCUMENT by Example Using SAS
Ebook
PROC DOCUMENT by Example Using SAS
byMichael Tuchman
Rating: 0 out of 5 stars
0 ratings
SAS Statistics Data Analysis Certification Questions: Unofficial SAS Data analysis Certification and Interview Questions
Ebook
SAS Statistics Data Analysis Certification Questions: Unofficial SAS Data analysis Certification and Interview Questions
byEquity Press
Rating: 5 out of 5 stars
5/5
Segmentation Analytics with SAS Viya: An Approach to Clustering and Visualization
Ebook
Segmentation Analytics with SAS Viya: An Approach to Clustering and Visualization
byRandall S. Collica
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Programming in SAS: A Case Studies Approach
Ebook
Fundamentals of Programming in SAS: A Case Studies Approach
byJames Blum
Rating: 0 out of 5 stars
0 ratings
Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition
Ebook
Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition
byChris Holland
Rating: 0 out of 5 stars
0 ratings
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
Ebook
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
byJane Eslinger
Rating: 0 out of 5 stars
0 ratings
Elementary Statistics Using SAS
Ebook
Elementary Statistics Using SAS
bySandra D. Schlotzhauer
Rating: 0 out of 5 stars
0 ratings

Enterprise Applications For You

Skip carousel

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Excel Formulas and Functions 2020: Excel Academy, #1
Ebook
Excel Formulas and Functions 2020: Excel Academy, #1
byAdam Ramirez
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
101 Ready-to-Use Excel Formulas
Ebook
101 Ready-to-Use Excel Formulas
byMichael Alexander
Rating: 4 out of 5 stars
4/5
Bitcoin For Dummies
Ebook
Bitcoin For Dummies
byPrypto
Rating: 4 out of 5 stars
4/5
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
Ebook
Microsoft Power Platform A Deep Dive: Dig into Power Apps, Power Automate, Power BI, and Power Virtual Agents (English Edition)
byBijay Kumar Sahoo
Rating: 0 out of 5 stars
0 ratings
Enterprise AI For Dummies
Ebook
Enterprise AI For Dummies
byZachary Jarvinen
Rating: 3 out of 5 stars
3/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Microsoft Outlook Guide to Success: Learn Smart Email Practices and Calendar Management for a Smooth Workflow [II EDITION]
Ebook
Microsoft Outlook Guide to Success: Learn Smart Email Practices and Calendar Management for a Smooth Workflow [II EDITION]
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Excel 2019 For Dummies
Ebook
Excel 2019 For Dummies
byGreg Harvey
Rating: 3 out of 5 stars
3/5
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
Ebook
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
byRobert W. Bly
Rating: 5 out of 5 stars
5/5
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
Ebook
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
byJames H. Moyle
Rating: 0 out of 5 stars
0 ratings
Learn Windows PowerShell in a Month of Lunches
Ebook
Learn Windows PowerShell in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
Ebook
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
byTerry R. Hoffmann
Rating: 0 out of 5 stars
0 ratings
Excel Guide for Success
Ebook
Excel Guide for Success
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Excel 2019 Bible
Ebook
Excel 2019 Bible
byMichael Alexander
Rating: 4 out of 5 stars
4/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Excel Formulas That Automate Tasks You No Longer Have Time For
Ebook
Excel Formulas That Automate Tasks You No Longer Have Time For
byErik Kopp
Rating: 5 out of 5 stars
5/5
Experts' Guide to OneNote
Ebook
Experts' Guide to OneNote
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
50 Useful Excel Functions: Excel Essentials, #3
Ebook
50 Useful Excel Functions: Excel Essentials, #3
byM.L. Humphrey
Rating: 5 out of 5 stars
5/5
QuickBooks Online For Dummies
Ebook
QuickBooks Online For Dummies
byDavid H. Ringstrom
Rating: 0 out of 5 stars
0 ratings
Excel Tips and Tricks
Ebook
Excel Tips and Tricks
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Essential Office 365 Third Edition: The Illustrated Guide to Using Microsoft Office
Ebook
Essential Office 365 Third Edition: The Illustrated Guide to Using Microsoft Office
byKevin Wilson
Rating: 3 out of 5 stars
3/5
Learning Microsoft Azure
Ebook
Learning Microsoft Azure
byGeoff Webber-Cross
Rating: 4 out of 5 stars
4/5
QuickBooks 2023 All-in-One For Dummies
Ebook
QuickBooks 2023 All-in-One For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
Building Web Services with Microsoft Azure
Ebook
Building Web Services with Microsoft Azure
byAlex Belotserkovskiy
Rating: 0 out of 5 stars
0 ratings
Evernote Essentials Guide (Boxed Set): Evernote Guide For Beginners for Organizing Your Life
Ebook
Evernote Essentials Guide (Boxed Set): Evernote Guide For Beginners for Organizing Your Life
bySpeedy Publishing
Rating: 3 out of 5 stars
3/5
MrExcel XL: The 40 Greatest Excel Tips of All Time
Ebook
MrExcel XL: The 40 Greatest Excel Tips of All Time
byBill Jelen
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
Podcast episode
MLA 021 Databricks: Discussing Databricks with Ming Chang from (part of )
byMachine Learning Guide
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Experimentation and A/B Testing For Modern Data Teams With Eppo: An interview with Eppo founder Chetan Sharma about the challenges of designing, running, and analyzing product experiments and the work that he is doing to make it more accessible to organizations of every size.
Podcast episode
Experimentation and A/B Testing For Modern Data Teams With Eppo: An interview with Eppo founder Chetan Sharma about the challenges of designing, running, and analyzing product experiments and the work that he is doing to make it more accessible to organizations of every size.
byData Engineering Podcast
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
Podcast episode
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+: A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.
Podcast episode
Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+: A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.
byData Engineering Podcast
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 464 - Azure Deployment Environments: Cale and Russell talk to the Microsoft Program Manager for DevBox and Azure Deployment Environments, Sagar Chandra Reddy Lankala, about how Azure Deployment Environments can enable rapid deployment of on-demand dev/test environments while providing governance, security and cost management - plus some more updates from Microsoft Build 2023! Media File: https://azpodcast.blob.core.windows.net/episodes/Episode464.mp3 Sagar's links: GA blog - https://aka.ms/ade-ga-blog Sign up for Terraform support - https://aka.ms/ade-terraform-signup LinkedIn profile - https://www.linkedin.com/in/sagarchandrareddy Other updates mentioned in this episode: Public preview: Introducing NGads V620-series VMs optimized for cloud gaming | Azure updates | Microsoft Azure Generally available: Azure Data Explorer Kusto Emulator on Linux | Azure updates | Microsoft Azure Explore the latest features for Datadog—An Azure Native ISV Service Microsoft Cost Management updates
Podcast episode
Episode 464 - Azure Deployment Environments: Cale and Russell talk to the Microsoft Program Manager for DevBox and Azure Deployment Environments, Sagar Chandra Reddy Lankala, about how Azure Deployment Environments can enable rapid deployment of on-demand dev/test environments while providing governance, security and cost management - plus some more updates from Microsoft Build 2023! Media File: https://azpodcast.blob.core.windows.net/episodes/Episode464.mp3 Sagar's links: GA blog - https://aka.ms/ade-ga-blog Sign up for Terraform support - https://aka.ms/ade-terraform-signup LinkedIn profile - https://www.linkedin.com/in/sagarchandrareddy Other updates mentioned in this episode: Public preview: Introducing NGads V620-series VMs optimized for cloud gaming | Azure updates | Microsoft Azure Generally available: Azure Data Explorer Kusto Emulator on Linux | Azure updates | Microsoft Azure Explore the latest features for Datadog—An Azure Native ISV Service Microsoft Cost Management updates
byThe Azure Podcast
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
448: Controlling Resource Limits: Controlling Resource Limits with rctl in FreeBSD, It’s always DNS, Google Summer of Code in BSD Projects, Rsync Technical Notes - Q4 2021, Userland CPU frequency scheduling for OpenBSD, and more.
Podcast episode
448: Controlling Resource Limits: Controlling Resource Limits with rctl in FreeBSD, It’s always DNS, Google Summer of Code in BSD Projects, Rsync Technical Notes - Q4 2021, Userland CPU frequency scheduling for OpenBSD, and more.
byBSD Now
0 ratings
0% found this document useful
Version Your Data Lakehouse Like Your Software With Nessie: Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
Podcast episode
Version Your Data Lakehouse Like Your Software With Nessie: Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
byData Engineering Podcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
Podcast episode
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
341: U-NAS-ification: FreeBSD on Power, DragonflyBSD 5.8 is here, Unifying FreeNAS/TrueNAS, OpenBSD vs. Prometheus and Go, gcc 4.2.1 removed from FreeBSD base, and more.
Podcast episode
341: U-NAS-ification: FreeBSD on Power, DragonflyBSD 5.8 is here, Unifying FreeNAS/TrueNAS, OpenBSD vs. Prometheus and Go, gcc 4.2.1 removed from FreeBSD base, and more.
byBSD Now
0 ratings
0% found this document useful
302: Contention Reduction: DragonFlyBSD's kernel optimizations pay off, differences between OpenBSD and Linux, NetBSD 2019 Google Summer of Code project list, Reducing that contention, fnaify 1.3 released, vmctl(8): CLI syntax changes, and things that Linux distributions should not do when packaging.
Podcast episode
302: Contention Reduction: DragonFlyBSD's kernel optimizations pay off, differences between OpenBSD and Linux, NetBSD 2019 Google Summer of Code project list, Reducing that contention, fnaify 1.3 released, vmctl(8): CLI syntax changes, and things that Linux distributions should not do when packaging.
byBSD Now
0 ratings
0% found this document useful
Unlocking Your dbt Projects With Practical Advice For Practitioners: The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
Podcast episode
Unlocking Your dbt Projects With Practical Advice For Practitioners: The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Mark Windholtz on Domain-Driven Design (DDD): Today we invite Mark Windholtz from Agile DNA to talk about how domain-driven design and extreme programming can help bridge the gap between development and business.
Podcast episode
Mark Windholtz on Domain-Driven Design (DDD): Today we invite Mark Windholtz from Agile DNA to talk about how domain-driven design and extreme programming can help bridge the gap between development and business.
byElixir Wizards
0 ratings
0% found this document useful
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
Podcast episode
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
Podcast episode
Build Better Tests For Your dbt Projects With Datafold And data-diff: Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
What Type Of SSD Should You Buy?
Tech Advisor
Article
What Type Of SSD Should You Buy?
Mar 31, 2021
5 min read
Benchmark your SSD
APC
Article
Benchmark your SSD
Nov 2, 2020
4 min read
Netgear ReadyNAS 422: This Box Is Fast And Built To Last
MacWorld
Article
Netgear ReadyNAS 422: This Box Is Fast And Built To Last
Nov 29, 2017
4 min read
Western Digital MyCloud Home Duo 8TB
Maximum PC
Article
Western Digital MyCloud Home Duo 8TB
Oct 15, 2019
3 min read
HotPicks
Linux Format
Article
HotPicks
Feb 11, 2020
13 min read
SATA Drives
PC Pro Magazine
Article
SATA Drives
Nov 10, 2022
2 min read
NAS: The Appliance Of Storage Science
PC Pro Magazine
Article
NAS: The Appliance Of Storage Science
Aug 12, 2021
3 min read
The Next-gen PC
APC
Article
The Next-gen PC
Nov 1, 2021
16 min read
Business NAS appliances 2023
PC Pro Magazine
Article
Business NAS appliances 2023
Apr 6, 2023
4 min read
NEXT-GEN NiGHTMArES
Maximum PC
Article
NEXT-GEN NiGHTMArES
Sep 14, 2021
16 min read
WD MyCloud Home Duo 8TB
APC
Article
WD MyCloud Home Duo 8TB
Nov 4, 2019
2 min read
Why You Don’t Need To Splurge On A Cutting-edge SSD
PCWorld
Article
Why You Don’t Need To Splurge On A Cutting-edge SSD
Aug 2, 2022
4 min read
How We Tested…
Linux Format
Article
How We Tested…
Jan 12, 2021
You’ll find these applications in the software repositories of most desktop distributions, even if the featured version is not the latest. Some programs provide Snap packages, and others provide installable binaries for RPM- and DEB-based distributio
1 min read
Ultrafast Workstation Storage
3D World
Article
Ultrafast Workstation Storage
May 20, 2020
12 min read
Business NAS appliances 2021
PC Pro Magazine
Article
Business NAS appliances 2021
May 13, 2021
4 min read
Western Digital MyCloud Home Duo 8TB
TechLife
Article
Western Digital MyCloud Home Duo 8TB
Nov 18, 2019
3 min read
Synology RackStation RS422+
PC Pro Magazine
Article
Synology RackStation RS422+
Dec 8, 2022
3 min read
The Future Of Dram
PC Powerplay
Article
The Future Of Dram
Sep 2, 2019
This could be the year we see the first DDR5 memory kits make it to market. We saw Micron and Cadence demo working DDR5 memory in late 2018, and even with JEDEC not finalizing the spec just yet, manufacturers are expecting to begin production in the
1 min read
SOLID BUYS: M.2SSDs
APC
Article
SOLID BUYS: M.2SSDs
Jan 23, 2023
33 min read
TerraMaster T6-423
PC Pro Magazine
Article
TerraMaster T6-423
Sep 11, 2022
3 min read
Sony Psz-ra4t Professional Raid: Clever Portable And Rugged Design, Plus Thunderbolt And USB
MacWorld
Article
Sony Psz-ra4t Professional Raid: Clever Portable And Rugged Design, Plus Thunderbolt And USB
Feb 19, 2019
4 min read
RAID with OpenMediaVault
Linux Format
Article
RAID with OpenMediaVault
Jul 2, 2019
2 min read
Asustor Drivestor 2
PC Pro Magazine
Article
Asustor Drivestor 2
Jan 6, 2022
3 min read
MARIADB Optimise And Control Your Databases
Linux Format
Article
MARIADB Optimise And Control Your Databases
Jul 30, 2019
9 min read
STEVE CASSIDY “As My Rule Goes, Always Follow The Sound Made When Things Are Swept Under The Carpet”
PC Pro Magazine
Article
STEVE CASSIDY “As My Rule Goes, Always Follow The Sound Made When Things Are Swept Under The Carpet”
Sep 7, 2023
8 min read
The Network NAS appliances 2024
PC Pro Magazine
Article
The Network NAS appliances 2024
Apr 4, 2024
4 min read
Business NAS appliances 2022
PC Pro Magazine
Article
Business NAS appliances 2022
Apr 10, 2022
4 min read
Synology DS3617xs
APC
Article
Synology DS3617xs
Jul 13, 2020
2 min read

Related categories

Skip carousel

Reviews for Mastering the SAS DS2 Procedure

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Mastering the SAS DS2 Procedure - Mark Jordan

Chapter 1: Getting Started

1.1 Introduction

1.1.1 What is DS2?

1.1.2 Traditional SAS DATA Step versus DS2

1.1.3 What to Expect from This Book

1.1.4 Prerequisite Knowledge

1.2 Accessing SAS and Setting Up for Practice

1.1 Introduction

Today’s data scientists deal with ever larger data sets from a widening variety of data sources, and the computations required to process that data are continually becoming more complex. As SAS has been modernized with each new release, most SAS procedures (PROCs) have been rewritten to be thread-enabled, allowing them to use multiple CPUs on a single computer or even to push processing into massively parallel processing (MPP) computing environments such as Teradata, Hadoop, or the SAS High-Performance Analytics grid. But the DATA step, with its sequential, observation-by-observation approach to data manipulation, has remained stubbornly single threaded.

In the summer of 2013, SAS released SAS 9.4, which included a revolutionary new programming language named DS2. Each subsequent maintenance release of SAS 9.4 has added new features and functionality in the DS2 language. The second edition of this book is intended to expand coverage of the DS2 language to include several of these new features, and to provide more in-depth coverage on features that have proven to be extraordinarily useful over the last couple of years.

1.1.1 What is DS2?

DS2 is basically DATA step programming redesigned from the ground up with several important goals. I like to think of DS2 as a language that combines the power and control of the Base SAS DATA step programming language with the simplicity of SQL and throws in just enough object-oriented features to make simple, reusable code modules a reality. With DS2 you can perform extremely complex data manipulation and transformation by writing intuitive, succinct, and compact programs. It always amazes me how much you can accomplish with just a little code!

Here is what DS2 can do:

● natively process American National Standards Institute (ANSI) SQL data types for better integration with external data stores

● provide modern structured programming constructs, making it simple to extend the functionality of the DS2 language with reusable code modules

● tightly integrate with SQL

● provide simple, safe syntax for multi-threaded processing to accelerate CPU-intensive tasks

2017 brought SAS 9.4 maintenance release 5 (SAS 9.4M5), and the DS2 language had further matured, incorporating new features and functionality. Here are a few of the features that made me decide to write a second edition of this book:

● PROC DSTODS2—a new procedure to help you convert existing DATA step programs to DS2 data programs.

● Addition of some useful global statements to PROC DS2.

● Support for the OF keyword when using variable array references as variable lists.

● Better CONNECT string documentation, and simpler syntax to limit which libraries DS2 connects to upon invocation.

● The new MERGE statement (which is very different from a DATA step MERGE…).

● The new PCRX packages, which allow using regular expressions to process text with better performance than when using PRX functions.

● A new LIBNAME engine for JSON files that makes reading JSON easier in traditional DATA step programs than using the JSON package in DS2, but is not as flexible.

1.1.2 Traditional SAS DATA Step versus DS2

If you have a SAS/ACCESS license for a supported database management system (DBMS), the traditional SAS DATA step can process DBMS data, but native data types are first translated to SAS 8-byte floating-point numeric or fixed-width character data types by the LIBNAME engine. This causes a loss of precision when dealing with higher-precision ANSI numeric data types, such as BIGINT or DECIMAL. DS2 is capable of directly manipulating ANSI data types—including multi-byte character types—even when processing on the SAS compute platform. Figure 1.1 compares and contrasts traditional Base SAS DATA step processing with DS2 data program processing, illustrated by a basic example.

Figure 1.1: Traditional DATA Step Processing versus DS2 Data Program Processing

As you can see in Figure 1.1, when the traditional SAS DATA step accesses DBMS data via the SAS/ACCESS engine using a LIBNAME statement, DBMS data types are automatically converted to fixed-width character or double-precision, floating-point numeric. In contrast, the DS2 data program accesses the RDBMS data via a special driver that is associated with SAS/ACCESS software and can therefore process the data in its native data type.

The SAS DATA step is essentially a data-driven loop: reading, manipulating, and writing out one observation at a time. If the process is computationally complex, it can easily become CPU bound; that is, data can be read into memory faster than it can be processed. If the DATA step elapsed (clock) time in the SAS log is about the same as the CPU time, your process is most likely CPU bound. DS2 can accelerate CPU-bound processing by processing data rows in parallel using DS2 threads. Figure 1.2 contrasts traditional SAS DATA step, single-threaded processing with multi-threaded processing using DS2 thread and data programs.

Figure 1.2: Serial Processing in the DATA Step versus DS2 Parallel Processing

As you can see in Figure 1.2, the traditional SAS DATA step must process each row of data sequentially. By contrast, DS2 can use thread and data programs to process multiple rows of data simultaneously.

Notice that both processes use a single read thread, so if the bottleneck is getting data from disk or the DBMS into memory on the SAS compute platform for processing, then threaded processing on the SAS compute platform will not improve overall performance. This situation is referred to as an I/O bound operation. DS2 uses a single read thread to feed multiple compute threads when processing in Base SAS to ensure that each row of data is distributed to only one compute thread for processing. Similarly, if I/O is the bottleneck and computations are taking place on the SAS platform, DS2 is unlikely to improve performance.

However, because today’s DBMS data is enormous, data movement should be completely avoided whenever possible. With DS2, in a properly provisioned and configured SAS installation that includes the SAS In-Database Code Accelerator, your DS2 programs can actually execute on the database hardware in the SAS Embedded Process without having to move data to the SAS compute platform at all. Figure 1.3 compares DS2 data program threaded processing on the SAS compute platform to in-database processing with DS2 and the SAS In-Database Code Accelerator.

Figure 1.3: Parallel Processing with Threads: SAS Compute Platform versus In-Database

As you can see in Figure 1.3, using DS2 thread and data programs with the SAS In-Database Code Accelerator enables the DS2 code to compile and execute on the massively parallel DBMS hardware. If the process reads from a DBMS table and also writes to a DBMS table, then only the code goes into the DBMS, and only the SAS log comes out. All processing takes place in the DBMS. This concept of taking the code to the data instead of the traditional bringing the data to the code greatly reduces the amount of data movement that is required for processing. It also extends the computational capabilities of the DBMS to include SAS functions and processing logic, and takes full advantage of the massively parallel processing (MPP) capabilities of the DBMS. If you are a SAS programmer or a data scientist in an environment that includes SAS, you will find that DS2 quickly becomes a must-have tool for data manipulation.

This edition wouldn’t be complete without a discussion of the new SAS® Viya® architecture with its SAS Cloud Analytic Services (CAS). CAS distributed processing works a lot like in-database processing, with two significant advantages:

1. Enormous data sets can be persisted in memory. This means that, once loaded, subsequent passes on the data can be made without having to reload the data from off-line storage. If the data is too large to load into memory all at once, you don’t need to modify your program—CAS handles the complexity behind the scenes to maximize throughput.

2. When in-database processing with the SAS In-Database Code Accelerator, only data that resides in the database is eligible for distributed processing on the database hardware. CAS can access data from a wide variety of sources through native direct access or external source data connectors.

DS2 running in CAS enables in-database-style parallel processing, but with the ability to source data from a variety of data stores, while the in-memory capabilities minimize physical I/O operations, as shown in Figure 1.4.

Figure 1.4: Parallel Processing with Threads: In-Database versus CAS

1.1.3 What to Expect from This Book

Data wrangling, as I use the term, is more than just cleaning up data to prepare it for analysis. A data wrangler acquires data from diverse sources, then structures, organizes, and combines it in unique ways to facilitate analysis and obtain new insights. This book teaches you to wrangle data using DS2, highlighting the similarities and differences between DS2 data programs and traditional DATA step processing, as well as leveraging DS2’s parallel-processing power to boost your data-wrangling speed.

Here is what you will be able to do after you finish reading this book:

● identify the types of processes for which the language was designed and understand the conditions indicating that DS2 is a good choice when attempting to improve the performance of existing DATA step processes

● identify which programming statements and functions are shared between DATA step and DS2 data programs

● identify the DATA step functionality not available in DS2 and understand why it was not included in the DS2 language

● know what new DS2 program functionality is not available in the traditional DATA step

● directly manipulate ANSI data types in a DS2 program

● understand the implications of handling data that contains both SAS missing and ANSI null values in the same process

● convert a Base SAS data manipulation process from DATA step to a DS2 data program

● understand the DS2 system methods and how they relate to traditional DATA step programming constructs

● create custom DS2 methods, extending the functionality of the DS2 language

● store custom DS2 methods in packages and reuse them in subsequent DS2 programs

● use DS2 packages to create object-oriented programs

● use predefined DS2 packages to add extra functionality to your DS2 data programs

● create DS2 thread programs and execute them from a DS2 data program for parallel processing of data records

● use BY-group and FIRST.variable and LAST.variable processing in a DS2 data program or thread to perform custom data summarizations, without requiring a presort of the data

● determine whether your system has the capability to execute DS2 programs in-database and, if so, execute your DS2 thread programs in parallel, fully distributed on an MPP DBMS platform

1.1.4 Prerequisite Knowledge

This book was written with the seasoned Base SAS programmer in mind. You can acquire the prerequisite knowledge from other SAS Press books, such as An Introduction to SAS University Edition by Ron Cody or The Little SAS Book: A Primer by Lora Delwiche and Susan Slaughter.

Before diving in, you’ll want to be familiar with the following key concepts:

● DATA step programming, in general

● SAS libraries

● accessing data with a LIBNAME statement

● reading and writing SAS data sets

● the role of the program data vector (PDV) in DATA step processing

● conditional processing techniques

● arrays

● iterative processing (DO loops)

● macro processing, in general

● assigning values to macro variables

● resolving macro variables in SAS code

● timing of macro process execution versus execution of other SAS code

● SQL joins

1.2 Accessing SAS and Setting Up for Practice

If you do not currently have access to SAS software, you can use the robust learning community online known as SAS Analytics U. From the SAS Analytics U website, you can download a free, up-to-date, and fully functional copy of SAS University Edition, which is provided as a virtual machine (VM). The SAS University Edition VM includes a completely installed, configured, well-provisioned SAS server. The examples in this book were all created and executed using SAS University Edition, with the exception of the sections requiring DBMS access. You can get your own free copy of SAS University Edition at http://go.sas.com/free_sas.

Getting Ready to Practice

1. Download the ZIP file containing the data for this book from http://support.sas.com/jordan.

2. Unzip the files to a location available to SAS. If you are using SAS University Edition, the shared folder you designated when setting up your SAS environment is a good location for these files.

3. In SAS, open the program _setup.sas, follow the directions in the program comments to modify the code for your SAS environment, and then submit the program. You will need to run this program only once.

You are now ready to run the sample programs that are included with this book.

If you exit SAS between study sessions, it is easy to return. When you start SAS again, just run the program named libnames.sas in order to re-establish your connection to the appropriate SAS libraries before working with the other programs from this book. As an aside, if you have difficulty re-establishing your SAS library connections with the libnames.sas program, there is no harm in rerunning _setup.sas. It just takes a little longer to

Enjoying the preview?

Page 1 of 1

Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition

About this ebook

Mark Jordan

Related authors

Related to Mastering the SAS DS2 Procedure

Related ebooks

Enterprise Applications For You

Related podcast episodes

Related articles

Related categories

Reviews for Mastering the SAS DS2 Procedure

What did you think?

Book preview

Mastering the SAS DS2 Procedure - Mark Jordan

Chapter 1: Getting Started

1.1 Introduction

1.1.1 What is DS2?

1.1.2 Traditional SAS DATA Step versus DS2

1.1.3 What to Expect from This Book

1.1.4 Prerequisite Knowledge

1.2 Accessing SAS and Setting Up for Practice

Getting Ready to Practice