Ebook756 pages5 hours

Scala for Data Science

Name: Scala for Data Science
Author: Bugnion Pascal
ISBN: 9781785289385

By Bugnion Pascal

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Leverage the power of Scala with different tools to build scalable, robust data science applications

About This Book

- A complete guide for scalable data science solutions, from data ingestion to data visualization
- Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations
- Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

Who This Book Is For

If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.

What You Will Learn

- Transform and filter tabular data to extract features for machine learning
- Implement your own algorithms or take advantage of MLLib’s extensive suite of models to build distributed machine learning pipelines
- Read, transform, and write data to both SQL and NoSQL databases in a functional manner
- Write robust routines to query web APIs
- Read data from web APIs such as the GitHub or Twitter API
- Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements
- Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
- Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

In Detail

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.
This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.
Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala’s emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.
This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

Style and approach

A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway

Skip carousel

Programming

LanguageEnglish

PublisherPackt Publishing

Release dateJan 28, 2016

ISBN9781785289385

Author

Bugnion Pascal

Related authors

Skip carousel

Related to Scala for Data Science

Related ebooks

Skip carousel

Learning Concurrent Programming in Scala - Second Edition
Ebook
Learning Concurrent Programming in Scala - Second Edition
byAleksandar Prokopec
Rating: 0 out of 5 stars
0 ratings
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Ebook
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
byJean-Georges Perrin
Rating: 0 out of 5 stars
0 ratings
Learning Concurrent Programming in Scala
Ebook
Learning Concurrent Programming in Scala
byAleksandar Prokopec
Rating: 0 out of 5 stars
0 ratings
Mastering PostgreSQL 9.6
Ebook
Mastering PostgreSQL 9.6
byHans-Jürgen Schönig
Rating: 0 out of 5 stars
0 ratings
PostgreSQL for Data Architects
Ebook
PostgreSQL for Data Architects
byJayadevan Maymala
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
PostgreSQL Development Essentials
Ebook
PostgreSQL Development Essentials
byManpreet Kaur
Rating: 5 out of 5 stars
5/5
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
PostgreSQL Server Programming
Ebook
PostgreSQL Server Programming
byHannu Krosing
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Neo4j High Performance
Ebook
Neo4j High Performance
bySonal Raj
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
Mastering PostgreSQL 12 - Third Edition: Advanced techniques to build and administer scalable and reliable PostgreSQL database applications, 3rd Edition
Ebook
Mastering PostgreSQL 12 - Third Edition: Advanced techniques to build and administer scalable and reliable PostgreSQL database applications, 3rd Edition
byHans-Jürgen Schönig
Rating: 0 out of 5 stars
0 ratings
Mastering Gradle
Ebook
Mastering Gradle
byMainak Mitra
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Apache Spark Machine Learning Blueprints
Ebook
Apache Spark Machine Learning Blueprints
byAlex Liu
Rating: 0 out of 5 stars
0 ratings
RESTful Web API Design with Node.js - Second Edition
Ebook
RESTful Web API Design with Node.js - Second Edition
byValentin Bojinov
Rating: 1 out of 5 stars
1/5
Cassandra High Availability
Ebook
Cassandra High Availability
byRobbie Strickland
Rating: 5 out of 5 stars
5/5
Mastering RabbitMQ
Ebook
Mastering RabbitMQ
byAyanoglu Emrah
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Java 9 with JShell
Ebook
Java 9 with JShell
byGastón C. Hillar
Rating: 0 out of 5 stars
0 ratings
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Scala for Machine Learning
Ebook
Scala for Machine Learning
byNicolas Patrick R.
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Ebook
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
byEric Vargas
Rating: 0 out of 5 stars
0 ratings
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
Ebook
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
byGame Guidez
Rating: 5 out of 5 stars
5/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
Ebook
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
bySlobodan Dmitrović
Rating: 0 out of 5 stars
0 ratings
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
Ebook
101 Amazing Nintendo NES Facts: Includes facts about the Famicom
byJimmy Russell
Rating: 4 out of 5 stars
4/5
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
Ebook
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
byChris Will
Rating: 1 out of 5 stars
1/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
EP 09: Application Contexts, Dependency Injection, and Inversion of Control - OH MY!
Podcast episode
EP 09: Application Contexts, Dependency Injection, and Inversion of Control - OH MY!
byPro Coder Show
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
Podcast episode
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Powering Rails Applications with Postgres - RUBY 621
Podcast episode
Powering Rails Applications with Postgres - RUBY 621
byRuby Rogues
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
Podcast episode
Designing A Non-Relational Database Engine: Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.
byData Engineering Podcast
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Podcast episode
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
S14:E7 - What is .NET and why is it useful (Maria Nagagga)
Podcast episode
S14:E7 - What is .NET and why is it useful (Maria Nagagga)
byCodeNewbie
0 ratings
0% found this document useful
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
Podcast episode
High Agency Pydantic > VC Backed Frameworks — with Jason Liu of Instructor
byLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Unleashing the Power of Postgres with Andrew Atkinson
Podcast episode
Unleashing the Power of Postgres with Andrew Atkinson
byRemote Ruby
0 ratings
0% found this document useful
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Podcast episode
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
byData Engineering Podcast
0 ratings
0% found this document useful
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
Podcast episode
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
byMaintainable
0 ratings
0% found this document useful
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
Podcast episode
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Shorten the distance between production data and insight: On this sponsored episode of the podcast, we talk with Stanimira Vlaeva, Developer Advocate at MongoDB, and Fredric Favelin, Technical Director, Partner Presales at MongoDB, about how a serverless database can minimize the distance between producing data and understanding it.
Podcast episode
Shorten the distance between production data and insight: On this sponsored episode of the podcast, we talk with Stanimira Vlaeva, Developer Advocate at MongoDB, and Fredric Favelin, Technical Director, Partner Presales at MongoDB, about how a serverless database can minimize the distance between producing data and understanding it.
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful

Skip carousel

Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Vector Vexations
Linux Format
Article
Vector Vexations
Apr 2, 2024
Why does MySQL not support vectors in its community edition? Generative AI is the hot topic in tech. GenAI relies on vector data. Yet Oracle has no plans to support vectors in the community edition of MySQL. If you want to try out vector data with ot
1 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Jonathan Ellis INTERVIEW
Linux Format
Article
Jonathan Ellis INTERVIEW
Oct 22, 2019
6 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
So Predictable? AI And Landscape Architecture
Landscape Architecture Australia
Article
So Predictable? AI And Landscape Architecture
Apr 30, 2023
6 min read
Rolling The Database As A Service
Linux Format
Article
Rolling The Database As A Service
Aug 27, 2019
A couple of times during our conversation, Robin alluded to the fact that DataStax has now set its eyes on helping users eradicate some of the day-to-day operational complexity from their workflow. The DataStax Apache Cassandra as a Service is one of
2 min read
Browser Wars 2020
Maximum PC
Article
Browser Wars 2020
May 26, 2020
8 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Visualise Smart- Home Sensor Data
Linux Format
Article
Visualise Smart- Home Sensor Data
Oct 17, 2023
8 min read
Scan Cloud RTX Virtual Workstation
PC Pro Magazine
Article
Scan Cloud RTX Virtual Workstation
Aug 7, 2022
2 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Browser Wars 2020
Linux Format
Article
Browser Wars 2020
Jun 30, 2020
8 min read
Browser wars 2020
APC
Article
Browser wars 2020
Nov 2, 2020
8 min read
Liz Rice Chief Open Source Officer at Isovalent
Techfastly
Article
Liz Rice Chief Open Source Officer at Isovalent
Apr 1, 2022
5 min read
Supercomputer On A Platter
Business Today
Article
Supercomputer On A Platter
Apr 1, 2022
CHENNAI-HEADQUARTERED automobile major TVS Motor Company uses high-performance computing (HPC) for running R&D simulations and testing the aero-dynamics of two-wheelers, which allows it to make the vehicles stable at speed and more efficient, cool en
7 min read

Related categories

Skip carousel

Reviews for Scala for Data Science

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Scala for Data Science - Bugnion Pascal

Scala for Data Science

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Installing the JDK

Installing and using SBT

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

eBooks, discount offers, and more

Questions

1. Scala and Data Science

Data science

Programming in data science

Why Scala?

Static typing and type inference

Scala encourages immutability

Scala and functional programs

Null pointer uncertainty

Easier parallelism

Interoperability with Java

When not to use Scala

Summary

References

2. Manipulating Data with Breeze

Code examples

Installing Breeze

Getting help on Breeze

Basic Breeze data types

Vectors

Dense and sparse vectors and the vector trait

Matrices

Building vectors and matrices

Advanced indexing and slicing

Mutating vectors and matrices

Matrix multiplication, transposition, and the orientation of vectors

Data preprocessing and feature engineering

Breeze – function optimization

Numerical derivatives

Regularization

An example – logistic regression

Towards re-usable code

Alternatives to Breeze

Summary

References

3. Plotting with breeze-viz

Diving into Breeze

Customizing plots

Customizing the line type

More advanced scatter plots

Multi-plot example – scatterplot matrix plots

Managing without documentation

Breeze-viz reference

Data visualization beyond breeze-viz

Summary

4. Parallel Collections and Futures

Parallel collections

Limitations of parallel collections

Error handling

Setting the parallelism level

An example – cross-validation with parallel collections

Futures

Future composition – using a future's result

Blocking until completion

Controlling parallel execution with execution contexts

Futures example – stock price fetcher

Summary

References

5. Scala and SQL through JDBC

Interacting with JDBC

First steps with JDBC

Connecting to a database server

Creating tables

Inserting data

Reading data

JDBC summary

Functional wrappers for JDBC

Safer JDBC connections with the loan pattern

Enriching JDBC statements with the pimp my library pattern

Wrapping result sets in a stream

Looser coupling with type classes

Type classes

Coding against type classes

When to use type classes

Benefits of type classes

Creating a data access layer

Summary

References

6. Slick – A Functional Interface for SQL

FEC data

Importing Slick

Defining the schema

Connecting to the database

Creating tables

Inserting data

Querying data

Invokers

Operations on columns

Aggregations with Group by

Accessing database metadata

Slick versus JDBC

Summary

References

7. Web APIs

A whirlwind tour of JSON

Querying web APIs

JSON in Scala – an exercise in pattern matching

JSON4S types

Extracting fields using XPath

Extraction using case classes

Concurrency and exception handling with futures

Authentication – adding HTTP headers

HTTP – a whirlwind overview

Adding headers to HTTP requests in Scala

Summary

References

8. Scala and MongoDB

MongoDB

Connecting to MongoDB with Casbah

Connecting with authentication

Inserting documents

Extracting objects from the database

Complex queries

Casbah query DSL

Custom type serialization

Beyond Casbah

Summary

References

9. Concurrency with Akka

GitHub follower graph

Actors as people

Hello world with Akka

Case classes as messages

Actor construction

Anatomy of an actor

Follower network crawler

Fetcher actors

Routing

Message passing between actors

Queue control and the pull pattern

Accessing the sender of a message

Stateful actors

Follower network crawler

Fault tolerance

Custom supervisor strategies

Life-cycle hooks

What we have not talked about

Summary

References

10. Distributed Batch Processing with Spark

Installing Spark

Acquiring the example data

Resilient distributed datasets

RDDs are immutable

RDDs are lazy

RDDs know their lineage

RDDs are resilient

RDDs are distributed

Transformations and actions on RDDs

Persisting RDDs

Key-value RDDs

Double RDDs

Building and running standalone programs

Running Spark applications locally

Reducing logging output and Spark configuration

Running Spark applications on EC2

Spam filtering

Lifting the hood

Data shuffling and partitions

Summary

Reference

11. Spark SQL and DataFrames

DataFrames – a whirlwind introduction

Aggregation operations

Joining DataFrames together

Custom functions on DataFrames

DataFrame immutability and persistence

SQL statements on DataFrames

Complex data types – arrays, maps, and structs

Structs

Arrays

Maps

Interacting with data sources

JSON files

Parquet files

Standalone programs

Summary

References

12. Distributed Machine Learning with MLlib

Introducing MLlib – Spam classification

Pipeline components

Transformers

Estimators

Evaluation

Regularization in logistic regression

Cross-validation and model selection

Beyond logistic regression

Summary

References

13. Web APIs with Play

Client-server applications

Introduction to web frameworks

Model-View-Controller architecture

Single page applications

Building an application

The Play framework

Dynamic routing

Actions

Composing the response

Understanding and parsing the request

Interacting with JSON

Querying external APIs and consuming JSON

Calling external web services

Parsing JSON

Asynchronous actions

Creating APIs with Play: a summary

Rest APIs: best practice

Summary

References

14. Visualization with D3 and the Play Framework

GitHub user data

Do I need a backend?

JavaScript dependencies through web-jars

Towards a web application: HTML templates

Modular JavaScript through RequireJS

Bootstrapping the applications

Client-side program architecture

Designing the model

The event bus

AJAX calls through JQuery

Response views

Drawing plots with NVD3

Summary

References

A. Pattern Matching and Extractors

Pattern matching in for comprehensions

Pattern matching internals

Extracting sequences

Summary

Reference

Index

Scala for Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2016

Production reference: 1220116

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-137-2

www.packtpub.com

Credits

Author

Pascal Bugnion

Reviewers

Umanga Bista

Radek Ostrowski

Yuanhang Wang

Commissioning Editor

Veena Pagare

Acquisition Editor

Sonali Vernekar

Content Development Editor

Shali Deeraj

Technical Editor

Suwarna Patil

Copy Editor

Tasneem Fatehi

Project Coordinator

Sanchita Mandal

Proofreader

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Production Coordinator

Arvindkumar Gupta

Cover Work

Arvindkumar Gupta

About the Author

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University.

Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.

I owe a huge debt of gratitude to my parents and my partner for supporting me in this, as well as my employer for encouraging me to pursue this project. I also thank the reviewers, Umanga Bista, Yuanhang Wang, and Radek Ostrowski for their tireless efforts, as well as the entire team at Packt for their support, advice, and hard work carrying this book to completion.

About the Reviewers

Umanga Bista is machine learning and real-time analytics enthusiast from Kathmandu. He completed his bachelors in computer engineering in September, 2013. Since then, he has been working at LogPoint, a SEIM product and company. He primarily works on building statistical plugins and real time, scalable, and fault tolerant architecture to process multiterabyte scale log data streams for security analytics, intelligence, and compliance.

Radek Ostrowski is a freelance big data engineer with an educational background in high-performance computing. He specializes in building scalable real-time data collection and predictive analytics platforms. He has worked at EPCC, University of Edinburgh in data-related projects for many years. Additionally, he has contributed to the success of a game's startup—deltaDNA, co-built super-scalable backend for PlayStation 4 at Sony, helped to improve data processes at Expedia, and started a Docker revolution at Tesco Bank. He is currently working with Spark and Scala for Max2 Inc, an NYC-based startup that is building a community-powered venue discovery platform, offering personalized recommendations, curated and real-time information.

Yuanhang Wang is a data scientist with primary focus on DSL design. He has dabbled in several functional programming languages. He is particularly interested in machine learning and programming language theory. He is currently a data scientist at China Mobile Research Center, working on typed data processing engine and optimizer that is built on top of several big-data platforms.

Yuanhang Wang describes himself as an enthusiast of purely functional programming and neural networks. He obtained his master's degrees both in Harbin Institute of Technology, China and University of Pavia, Italy.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

To my parents.

To Jessica and to my friends.

Preface

Data science is fashionable. Data science startups are sprouting across the globe and established companies are scrambling to assemble data science teams. The ability to analyze large datasets is also becoming increasingly important in the academic and research world.

Why this explosion in demand for data scientists? Our view is that the emergence of data science can be viewed as the serendipitous collusion of several interlinked factors. The first is data availability. Over the last fifteen years, the amount of data collected by companies has exploded. In the world of research, cheap gene sequencing techniques have drastically increased the amount of genomic data available. Social and professional networking sites have built huge graphs interlinking a significant fraction of the people living on the planet. At the same time, the development of the World Wide Web makes accessing this wealth of data possible from almost anywhere in the world.

The increased availability of data has resulted in an increase in data awareness. It is no longer acceptable for decision makers to trust their experience and gut feeling alone. Increasingly, one expects business decisions to be driven by data.

Finally, the tools for efficiently making sense of and extracting insights from huge data sets are starting to mature: one doesn't need to be an expert in distributed computing to analyze a large data set any more. Apache Spark, for instance, greatly eases writing distributed data analysis applications. The explosion of cloud infrastructure facilitates scaling computing needs to cope with variable data amounts.

Scala is a popular language for data science. By emphasizing immutability and functional constructs, Scala lends itself well to the construction of robust libraries for concurrency and big data analysis. A rich ecosystem of tools for data science has therefore developed around Scala, including libraries for accessing SQL and NoSQL databases, frameworks for building distributed applications like Apache Spark and libraries for linear algebra and numerical algorithms. We will explore this rich and growing ecosystem in the fourteen chapters of this book.

What this book covers

We aim to give you a flavor for what is possible with Scala, and to get you started using libraries that are useful for building data science applications. We do not aim to provide an entirely comprehensive overview of any of these topics. This is best left to online documentation or to reference books. What we will teach you is how to combine these tools to build efficient, scalable programs, and have fun along the way.

Chapter 1, Scala and Data Science, is a brief description of data science, and of Scala's place in the data scientist's tool-belt. We describe why Scala is becoming increasingly popular in data science, and how it compares to alternative languages such as Python.

Chapter 2, Manipulating Data with Breeze, introduces Breeze, a library providing support for numerical algorithms in Scala. We learn how to perform linear algebra and optimization, and solve a simple machine learning problem using logistic regression.

Chapter 3, Plotting with breeze-viz, introduces the breeze-viz library for plotting two-dimensional graphs and histograms.

Chapter 4, Parallel Collections and Futures, describes basic concurrency constructs. We will learn to parallelize simple problems by distributing them over several threads using parallel collections, and apply what we have learned to build a parallel cross-validation pipeline. We then describe how to wrap computation in a future to execute it asynchronously. We apply this pattern to query a web API, sending several requests in parallel.

Chapter 5, Scala and SQL through JDBC, looks at interacting with SQL databases in a functional manner. We learn how to use common Scala patterns to wrap the Java interface exposed by JDBC. Besides learning about JDBC, this chapter introduces type classes, the loan pattern, implicit conversions, and other patterns that are frequently leveraged in libraries and existing Scala code.

Chapter 6, Slick - A Functional Interface for SQL, describes the Slick library for mapping data in SQL tables to Scala objects.

Chapter 7, Web APIs, describes how to query web APIs in a concurrent, fault-tolerant manner using futures. We learn to parse JSON responses and formulate complex HTTP requests with authentication. We walk through querying the GitHub API to obtain information about GitHub users programmatically.

Chapter 8, Scala and MongoDB, walks the reader through interacting with MongoDB, a leading NoSQL database. We build a pipeline that fetches user data from the GitHub API and stores it in a MongoDB database.

Chapter 9, Concurrency with Akka, introduces the Akka framework for building concurrent applications with actors. We use Akka to build a scalable crawler that explores the GitHub follower graph.

Chapter 10, Distributed Batch Processing with Spark, explores the Apache Spark framework for building distributed applications. We learn how to construct and manipulate distributed datasets in memory. We touch briefly on the internals of Spark, learning how the architecture allows for distributed, fault-tolerant computation.

Chapter 11, Spark SQL and DataFrames, describes DataFrames, one of the more powerful features of Spark for the manipulation of structured data. We learn how to load JSON and Parquet files into DataFrames.

Chapter 12, Distributed Machine Learning with MLlib, explores how to build distributed machine learning pipelines with MLlib, a library built on top of Apache Spark. We use the library to train a spam filter.

Chapter 13, Web APIs with Play, describes how to use the Play framework to build web APIs. We describe the architecture of modern web applications, and how these fit into the data science pipeline. We build a simple web API that returns JSON.

Chapter 14, Visualization with D3 and the Play Framework, builds on the previous chapter to program a fully fledged web application with Play and D3. We describe how to integrate JavaScript into a Play framework application.

Appendix, Pattern Matching and Extractors, describes how pattern matching provides the programmer with a powerful construct for control flow.

What you need for this book

The examples provided in this book require that you have a working Scala installation and SBT, the Simple Build Tool, a command line utility for compiling and running Scala code. We will walk you through how to install these in the next sections.

We do not require a specific IDE. The code examples can be written in your favorite text editor or IDE.

Installing the JDK

Scala code is compiled to Java byte code. To run the byte code, you must have the Java Virtual Machine (JVM) installed, which comes as part of a Java Development Kit (JDK). There are several JDK implementations and, for the purpose of this book, it does not matter which one you choose. You may already have a JDK installed on your computer. To check this, enter the following in a terminal:

$ java -version java version 1.8.0_66 Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

If you do not have a JDK installed, you will get an error stating that the java command does not exist.

If you do have a JDK installed, you should still verify that you are running a sufficiently recent version. The number that matters is the minor version number: the 8 in 1.8.0_66. Versions 1.8.xx of Java are commonly referred to as Java 8. For the first twelve chapters of this book, Java 7 will be sufficient (your version number should be something like 1.7.xx or newer). However, you will need Java 8 for the last two chapters, since the Play framework requires it. We therefore recommend that you install Java 8.

On Mac, the easiest way to install a JDK is using Homebrew:

$ brew install java

This will install Java 8, specifically the Java Standard Edition Development Kit, from Oracle.

Homebrew is a package manager for Mac OS X. If you are not familiar with Homebrew, I highly recommend using it to install development tools. You can find installation instructions for Homebrew on: http://brew.sh.

To install a JDK on Windows, go to http://www.oracle.com/technetwork/java/javase/downloads/index.html (or, if this URL does not exist, to the Oracle website, then click on Downloads and download Java Platform, Standard Edition). Select Windows x86 for 32-bit Windows, or Windows x64 for 64 bit. This will download an installer, which you can run to install the JDK.

To install a JDK on Ubuntu, install OpenJDK with the package manager for your distribution:

$ sudo apt-get install openjdk-8-jdk

If you are running a sufficiently old version of Ubuntu (14.04 or earlier), this package will not be available. In this case, either fall back to openjdk-7-jdk, which will let you run examples in the first twelve chapters, or install the Java Standard Edition Development Kit from Oracle through a PPA (a non-standard package archive):

$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer

You then need to tell Ubuntu to prefer Java 8 with:

$ sudo update-java-alternatives -s java-8-oracle

Installing and using SBT

The Simple Build Tool (SBT) is a command line tool for managing dependencies and building and running Scala code. It is the de facto build tool for Scala. To install SBT, follow the instructions on the SBT website (http://www.scala-sbt.org/0.13/tutorial/Setup.html).

When you start a new SBT project, SBT downloads a specific version of Scala for you. You, therefore, do not need to install Scala directly on your computer. Managing the entire dependency suite from SBT, including Scala itself, is powerful: you do not have to worry about developers working on the same project having different versions of Scala or of the libraries used.

Since we will use SBT extensively in this book, let's create a simple test project. If you have used SBT previously, do skip this section.

Create a new directory called sbt-example and navigate to it. Inside this directory, create a file called build.sbt. This file encodes all the dependencies for the project. Write the following in build.sbt:

// build.sbt

scalaVersion := 2.11.7

This specifies which version of Scala we want to use for the project. Open a terminal in the sbt-example directory and type:

$ sbt

This starts an interactive shell. Let's open a Scala console:

> console

This gives you access to a Scala console in the context of your project:

scala> println(Scala is running!) Scala is running!

Besides running code in the console, we will also write Scala programs. Open an editor in the sbt-example directory and enter a basic hello, world program. Name the file HelloWorld.scala:

// HelloWorld.scala

object HelloWorld extends App {

println(Hello, world!)

}

Return to SBT and type:

> run

This will compile the source files and run the executable, printing Hello, world!.

Besides compiling and running your Scala code, SBT also manages Scala dependencies. Let's specify a dependency on Breeze, a library for numerical algorithms. Modify the build.sbt file as follows:

// build.sbt

scalaVersion := 2.11.7

libraryDependencies ++= Seq(

org.scalanlp %% breeze % 0.11.2,

org.scalanlp %% breeze-natives % 0.11.2

)

SBT requires that statements be separated by empty lines, so make sure that you leave an empty line between scalaVersion and libraryDependencies. In this example, we have specified a dependency on Breeze version 0.11.2. How did we know to use these coordinates for Breeze? Most Scala packages will quote the exact SBT string to get the latest version in their documentation.

If this is not the case, or you are specifying a dependency on a Java library, head to the Maven Central website (http://mvnrepository.com) and search for the package of interest, for example Breeze. The website provides a list of packages, including several named breeze_2.xx packages. The number after the underscore indicates the version of Scala the package was compiled for. Click on breeze_2.11 to get a list of the different Breeze versions available. Choose 0.11.2. You will be presented with a list of package managers to choose from (Maven, Ivy, Leiningen, and so on). Choose SBT. This will print a line like:

libraryDependencies += org.scalanlp % breeze_2.11 % 0.11.2

These are the coordinates that you will want to copy to the build.sbt file. Note that we just specified breeze, rather than breeze_2.11. By preceding the package name with two percentage signs, %%, SBT automatically resolves to the correct Scala version. Thus, specifying %% breeze is identical to % breeze_2.11.

Now return to your SBT console and run:

> reload

This will fetch the Breeze jars from Maven Central. You can now import Breeze in either the console or your scripts (within the context of this Scala project). Let's test this in the console:

> console scala> import breeze.linalg._ import breeze.linalg._

scala> import breeze.numerics._ import breeze.numerics._

scala> val vec = linspace(-2.0, 2.0, 100) vec: breeze.linalg.DenseVector[Double] = DenseVector(-2.0, -1.9595959595959596, ...

scala> sigmoid(vec) breeze.linalg.DenseVector[Double] = DenseVector(0.11920292202211755, 0.12351078065 ...

You should now be able to compile, run and specify dependencies for your Scala scripts.

Who this book is for

This book introduces the data science ecosystem for people who already know some Scala. If you are a data scientist, or data engineer, or if you want to enter data science, this book will give you all the tools you need to implement data science solutions in Scala.

For the avoidance of doubt, let me also clarify what this book is not:

This is not an introduction to Scala. We assume that you already have a working knowledge of the language. If you do not, we recommend Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners.

This is not a book about machine learning in Scala. We will use machine learning to illustrate the examples, but the aim is not to teach you how to write your own gradient-boosted tree class. Machine learning is just one (important) part of data science, and this book aims to cover the full pipeline, from data acquisition to data visualization. If you are interested more specifically in how to implement machine learning solutions in Scala, I recommend Scala for machine learning, by Patrick R. Nicolas.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, and user input are shown as follows: We can import modules with the import statement.

A block of code is set as follows:

def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {

for {

(currentElem, index) <- collection.zipWithIndex

if (currentElem == elem)

} yield index

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {

for {

(currentElem, index) <- collection.zipWithIndex

if (currentElem == elem)

} yield index

}

Any command-line input or output is written as follows:

scala> val nTosses = 100 nTosses: Int = 100

scala> def trial = (0 until nTosses).count { i => util.Random.nextBoolean() // count the number of heads } trial: Int

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Clicking the Next button moves you to the next screen.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

The code examples are also available on GitHub at www.github.com/pbugnion/s4ds.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

eBooks, discount offers, and more

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Questions

If you have a problem with any aspect of this book, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.

Chapter 1. Scala and Data Science

The second half of the 20th century was the age of silicon. In fifty years, computing power went from extremely scarce to entirely mundane. The first half of the 21st century is the age of the Internet. The last 20 years have seen the rise of giants such as Google, Twitter, and Facebook—giants that have forever changed the way we view knowledge.

The Internet is a vast nexus of information. Ninety percent of the data generated by humanity has been generated

Enjoying the preview?

Page 1 of 1

Scala for Data Science

About this ebook

Bugnion Pascal

Related authors

Related to Scala for Data Science

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Scala for Data Science

What did you think?

Book preview

Scala for Data Science - Bugnion Pascal

Table of Contents

Scala for Data Science

Scala for Data Science

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Installing the JDK

Installing and using SBT

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

eBooks, discount offers, and more

Questions

Chapter 1. Scala and Data Science