Ebook280 pages1 hour

Programming MapReduce with Scalding

Name: Programming MapReduce with Scalding
Author: Antonios Chalkiopoulos
ISBN: 9781783287024

By Antonios Chalkiopoulos

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is an easy-to-understand, practical guide to designing, testing, and implementing complex MapReduce applications in Scala using the Scalding framework. It is packed with examples featuring log-processing, ad-targeting, and machine learning.
This book is for developers who are willing to discover how to effectively develop MapReduce applications. Prior knowledge of Hadoop or Scala is not required; however, investing some time on those topics would certainly be beneficial.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateJun 25, 2014

ISBN9781783287024

Author

Antonios Chalkiopoulos

Related authors

Skip carousel

Related to Programming MapReduce with Scalding

Related ebooks

Skip carousel

Learning Cascading
Ebook
Learning Cascading
byMichael Covert
Rating: 0 out of 5 stars
0 ratings
OpenStack Sahara Essentials
Ebook
OpenStack Sahara Essentials
byOmar Khedher
Rating: 0 out of 5 stars
0 ratings
Scientific Computing with Scala
Ebook
Scientific Computing with Scala
byVytautas Jančauskas
Rating: 0 out of 5 stars
0 ratings
Splunk Developer's Guide
Ebook
Splunk Developer's Guide
byKyle Smith
Rating: 0 out of 5 stars
0 ratings
Optimizing Hadoop for MapReduce
Ebook
Optimizing Hadoop for MapReduce
byKhaled Tannir
Rating: 0 out of 5 stars
0 ratings
Couchbase Essentials
Ebook
Couchbase Essentials
byJohn Zablocki
Rating: 0 out of 5 stars
0 ratings
Advanced Express Web Application Development
Ebook
Advanced Express Web Application Development
byAndrew Keig
Rating: 0 out of 5 stars
0 ratings
RESS Essentials
Ebook
RESS Essentials
byJoanna Krenz-Kurowska
Rating: 0 out of 5 stars
0 ratings
Hadoop Cluster Deployment
Ebook
Hadoop Cluster Deployment
byDanil Zburivsky
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Mastering Java for Data Science
Ebook
Mastering Java for Data Science
byAlexey Grigorev
Rating: 5 out of 5 stars
5/5
Developing with Docker
Ebook
Developing with Docker
byJarosław Krochmalski
Rating: 5 out of 5 stars
5/5
Building Web Applications with Python and Neo4j
Ebook
Building Web Applications with Python and Neo4j
byGupta Sumit
Rating: 0 out of 5 stars
0 ratings
Cloud Development and Deployment with CloudBees
Ebook
Cloud Development and Deployment with CloudBees
byNicolas De loof
Rating: 0 out of 5 stars
0 ratings
Getting Started with Hazelcast
Ebook
Getting Started with Hazelcast
byMat Johns
Rating: 0 out of 5 stars
0 ratings
Cassandra Design Patterns - Second Edition
Ebook
Cassandra Design Patterns - Second Edition
byThottuvaikkatumana Rajanarayanan
Rating: 0 out of 5 stars
0 ratings
Learning NServiceBus Sagas
Ebook
Learning NServiceBus Sagas
byRich Helton
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Learning Karaf Cellar
Ebook
Learning Karaf Cellar
byJean-Baptiste Onofré
Rating: 0 out of 5 stars
0 ratings
Mastering Scala Machine Learning
Ebook
Mastering Scala Machine Learning
byAlex Kozlov
Rating: 0 out of 5 stars
0 ratings
Learning OpenDaylight
Ebook
Learning OpenDaylight
byReza Toghraee
Rating: 0 out of 5 stars
0 ratings
Mastering Sass
Ebook
Mastering Sass
byLuke Watts
Rating: 0 out of 5 stars
0 ratings
Express Web Application Development
Ebook
Express Web Application Development
byHage Yaapa
Rating: 3 out of 5 stars
3/5
Splunk Developer's Guide - Second Edition
Ebook
Splunk Developer's Guide - Second Edition
byKyle Smith
Rating: 0 out of 5 stars
0 ratings
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
Troubleshooting PostgreSQL
Ebook
Troubleshooting PostgreSQL
byHans-Jürgen Schönig
Rating: 5 out of 5 stars
5/5
Mastering JavaScript Design Patterns - Second Edition
Ebook
Mastering JavaScript Design Patterns - Second Edition
bySimon Timms
Rating: 5 out of 5 stars
5/5
Learning Splunk Web Framework
Ebook
Learning Splunk Web Framework
byVincent Sesto
Rating: 0 out of 5 stars
0 ratings
Learning Couchbase
Ebook
Learning Couchbase
byPotsangbam Henry
Rating: 0 out of 5 stars
0 ratings

Internet & Web For You

Skip carousel

The $1,000,000 Web Designer Guide: A Practical Guide for Wealth and Freedom as an Online Freelancer
Ebook
The $1,000,000 Web Designer Guide: A Practical Guide for Wealth and Freedom as an Online Freelancer
byRob Anthony O'Rourke
Rating: 5 out of 5 stars
5/5
The Logo Brainstorm Book: A Comprehensive Guide for Exploring Design Directions
Ebook
The Logo Brainstorm Book: A Comprehensive Guide for Exploring Design Directions
byJim Krause
Rating: 4 out of 5 stars
4/5
Beginner's Guide To Starting An Etsy Print-On-Demand Shop
Ebook
Beginner's Guide To Starting An Etsy Print-On-Demand Shop
byAnn Eckhart
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
How To Make Money Blogging: How I Replaced My Day-Job With My Blog and How You Can Start A Blog Today
Ebook
How To Make Money Blogging: How I Replaced My Day-Job With My Blog and How You Can Start A Blog Today
byBob Lotich
Rating: 4 out of 5 stars
4/5
Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications
Ebook
Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications
byRobert Oliver
Rating: 0 out of 5 stars
0 ratings
So You Want to Start a Podcast: Finding Your Voice, Telling Your Story, and Building a Community That Will Listen
Ebook
So You Want to Start a Podcast: Finding Your Voice, Telling Your Story, and Building a Community That Will Listen
byKristen Meinzer
Rating: 3 out of 5 stars
3/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
Ebook
The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet
byChris Mason
Rating: 4 out of 5 stars
4/5
Coding For Dummies
Ebook
Coding For Dummies
byNikhil Abraham
Rating: 5 out of 5 stars
5/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
The Digital Marketing Handbook: A Step-By-Step Guide to Creating Websites That Sell
Ebook
The Digital Marketing Handbook: A Step-By-Step Guide to Creating Websites That Sell
byRobert W Bly
Rating: 5 out of 5 stars
5/5
The Designer's Web Handbook: What You Need to Know to Create for the Web
Ebook
The Designer's Web Handbook: What You Need to Know to Create for the Web
byPatrick McNeil
Rating: 0 out of 5 stars
0 ratings
Cybersecurity For Dummies
Ebook
Cybersecurity For Dummies
byJoseph Steinberg
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
200+ Ways to Protect Your Privacy: Simple Ways to Prevent Hacks and Protect Your Privacy--On and Offline
Ebook
200+ Ways to Protect Your Privacy: Simple Ways to Prevent Hacks and Protect Your Privacy--On and Offline
byJeni Rogers
Rating: 0 out of 5 stars
0 ratings
How To Start A Profitable Authority Blog In Under One Hour
Ebook
How To Start A Profitable Authority Blog In Under One Hour
byPassive Marketing
Rating: 5 out of 5 stars
5/5
How To Start A Podcast
Ebook
How To Start A Podcast
byP Teague
Rating: 4 out of 5 stars
4/5
Mike Meyers' CompTIA Security+ Certification Guide, Third Edition (Exam SY0-601)
Ebook
Mike Meyers' CompTIA Security+ Certification Guide, Third Edition (Exam SY0-601)
byMike Meyers
Rating: 5 out of 5 stars
5/5
Hacking : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Ethical Hacking
Ebook
Hacking : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Ethical Hacking
byKevin Clark
Rating: 5 out of 5 stars
5/5
Six Figure Blogging Blueprint
Ebook
Six Figure Blogging Blueprint
byRaza Imam
Rating: 5 out of 5 stars
5/5
Web Copy That Sells: The Revolutionary Formula for Creating Killer Copy That Grabs Their Attention and Compels Them to Buy
Ebook
Web Copy That Sells: The Revolutionary Formula for Creating Killer Copy That Grabs Their Attention and Compels Them to Buy
byMaria Veloso
Rating: 4 out of 5 stars
4/5
Social Engineering: The Science of Human Hacking
Ebook
Social Engineering: The Science of Human Hacking
byChristopher Hadnagy
Rating: 3 out of 5 stars
3/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
How to Disappear and Live Off the Grid: A CIA Insider's Guide
Ebook
How to Disappear and Live Off the Grid: A CIA Insider's Guide
byJohn Kiriakou
Rating: 0 out of 5 stars
0 ratings
The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning
Ebook
The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning
byJustin Smith-Ruiu
Rating: 4 out of 5 stars
4/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
How to Be Invisible: Protect Your Home, Your Children, Your Assets, and Your Life
Ebook
How to Be Invisible: Protect Your Home, Your Children, Your Assets, and Your Life
byJ. J. Luna
Rating: 4 out of 5 stars
4/5
Podcasting For Dummies
Ebook
Podcasting For Dummies
byTee Morris
Rating: 4 out of 5 stars
4/5
Stop Asking Questions: How to Lead High-Impact Interviews and Learn Anything from Anyone
Ebook
Stop Asking Questions: How to Lead High-Impact Interviews and Learn Anything from Anyone
byAndrew Warner
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

66: Custom Elements & Skate.js: Summary Atlassian leaders Trey Shugart (@treshugart) and Jonathon Creenaune (@jcreenaune) chat with us about how and why they created Skate.js. Skate is a lightweight Web Components wrapper created to help the needs of a large and diverse technology...
Podcast episode
66: Custom Elements & Skate.js: Summary Atlassian leaders Trey Shugart (@treshugart) and Jonathon Creenaune (@jcreenaune) chat with us about how and why they created Skate.js. Skate is a lightweight Web Components wrapper created to help the needs of a large and diverse technology...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
Podcast episode
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
byMLOps.community
0 ratings
0% found this document useful
MLOps Meetup #25 // Python and Dask: Scaling the DataFrame // Dan Gerlanc - Founder of Enplus Advisors
Podcast episode
MLOps Meetup #25 // Python and Dask: Scaling the DataFrame // Dan Gerlanc - Founder of Enplus Advisors
byMLOps.community
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
How Redpanda Extracts Business Value from Data Events with Alex Gallego
Podcast episode
How Redpanda Extracts Business Value from Data Events with Alex Gallego
byScreaming in the Cloud
0 ratings
0% found this document useful
72: Teaching and Learning Angular: Summary Kent C. Dodds (@kentcdodds) & Shai Reznik (@shai_reznik) join us for episode 72 about teaching and learning the popular Angular JavaScript Framework. These two veteran technologists provide great insights into how they teach code, what...
Podcast episode
72: Teaching and Learning Angular: Summary Kent C. Dodds (@kentcdodds) & Shai Reznik (@shai_reznik) join us for episode 72 about teaching and learning the popular Angular JavaScript Framework. These two veteran technologists provide great insights into how they teach code, what...
byThe Web Platform Podcast
0 ratings
0% found this document useful
70: Web Components at Microsoft: Summary Daniel Buchner (@csuwildcat), former Mozillian & Program Manager at Microsoft takes us through the plans for Web Components at Microsoft. Daniel is the creator of the Web Components free open source library, X-Tag which Microsoft is now...
Podcast episode
70: Web Components at Microsoft: Summary Daniel Buchner (@csuwildcat), former Mozillian & Program Manager at Microsoft takes us through the plans for Web Components at Microsoft. Daniel is the creator of the Web Components free open source library, X-Tag which Microsoft is now...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Podcast episode
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
Podcast episode
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
byData Engineering Podcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Podcast episode
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse: Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
Podcast episode
The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse: Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
byData Engineering Podcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack: If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
Podcast episode
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack: If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
byData Engineering Podcast
0 ratings
0% found this document useful
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
Podcast episode
Surveying The Market Of Database Products: Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
Podcast episode
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Podcast episode
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
byData Engineering Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
Podcast episode
Use Your Data Warehouse To Power Your Product Analytics With NetSpring: With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow.
byData Engineering Podcast
0 ratings
0% found this document useful
Version Your Data Lakehouse Like Your Software With Nessie: Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
Podcast episode
Version Your Data Lakehouse Like Your Software With Nessie: Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
byData Engineering Podcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Unlocking Your dbt Projects With Practical Advice For Practitioners: The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
Podcast episode
Unlocking Your dbt Projects With Practical Advice For Practitioners: The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
byData Engineering Podcast
0 ratings
0% found this document useful
67: Keeping Fluent with Web Technology: Summary How do you keep up with the vast amounts of web technology released daily? It can be a losing battle for some and a opportunity for others. One person in our community that comes to mind is Peter Cooper (@peterc) from Cooper Press. Join us...
Podcast episode
67: Keeping Fluent with Web Technology: Summary How do you keep up with the vast amounts of web technology released daily? It can be a losing battle for some and a opportunity for others. One person in our community that comes to mind is Peter Cooper (@peterc) from Cooper Press. Join us...
byThe Web Platform Podcast
0 ratings
0% found this document useful

Skip carousel

Darq
PC Pro Magazine
Article
Darq
Jul 9, 2022
3 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Cloudy With No Chance Of Erp
Architectural Review Asia Pacific
Article
Cloudy With No Chance Of Erp
Nov 11, 2019
ERP (enterprise resource planning) was born around the time the first ‘[Something] for Dummies’ book was published*. It’s typically inflexible, uncompromising software designed for large businesses, like banks, large corporations, manufacturing and s
2 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Use Katana For Lookdev And Lighting
3D World
Article
Use Katana For Lookdev And Lighting
Sep 7, 2021
3 min read
CalicoPie Family Historian 7
Computeractive
Article
CalicoPie Family Historian 7
Mar 24, 2021
SOFTWARE | £60 from Family Historian Store www.snipca.com/37615 If you’ve ever researched your family tree, you’ll know it’s much harder than the BBC’s celebrity genealogy programme Who Do You Think You Are? makes it appear. You’ll certainly need to
2 min read
Decathlon Singapore Is Going All Out On Data
HWM Singapore
Article
Decathlon Singapore Is Going All Out On Data
Aug 5, 2022
2 min read
Cloud Sovereignty
Linux Format
Article
Cloud Sovereignty
Jan 11, 2022
4 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
Plotting applications The Verdict
Linux Format
Article
Plotting applications The Verdict
Mar 10, 2020
2 min read
Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
Mac Writing Apps
MacFormat
Article
Mac Writing Apps
Nov 15, 2022
5 min read
Sync Or Swim Adobe Spark
Screen Education
Article
Sync Or Swim Adobe Spark
Apr 1, 2018
I realise that I’ve gotten into a bit of a rhythm with these Sync or Swim columns: the introduction of each could easily be prefaced by ‘I don’t want to go off on a rant, but … ’, and they tend to involve me taking a few jabs at various educational t
8 min read
Craft A Perfect Personal Document Library
iCreate
Article
Craft A Perfect Personal Document Library
Jan 26, 2023
There are countless apps available that can be used to organise your notes and also many word processors designed to help you create smart-looking documents, but Craft aims to do both in style. The idea is to include a huge number of advanced documen
1 min read
Other Cool Stuff You Can Do
MacFormat
Article
Other Cool Stuff You Can Do
Apr 2, 2024
3 min read
What Should I Download?
Computeractive
Article
What Should I Download?
Oct 25, 2023
Q I need to find a way to convert four PDF files containing 3,500 pages into Word documents. Can you help? Also, I have to send them to a charity, but they’re too large to email. What should I do? CC Callaghan A PDF Conversa (www.snipca.com/47680, pi
2 min read
How Can AI Help Your Business?
PC Pro Magazine
Article
How Can AI Help Your Business?
Jun 8, 2023
7 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
REDUCING IT COSTS FOR SMEs
PC Pro Magazine
Article
REDUCING IT COSTS FOR SMEs
Jan 5, 2023
Where is your company’s data stored? It’s all the rage to push data up into the cloud and to make it someone else’s problem. However, this is rarely the real outcome. While I would accept that a well-run data centre is likely to be more robust than a
4 min read
22 Awesome Open-source Programs That Do Everything You Need
PCWorld
Article
22 Awesome Open-source Programs That Do Everything You Need
Oct 30, 2023
6 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Manage Your Apps!
Linux Format
Article
Manage Your Apps!
Nov 14, 2023
17 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Open Success
Linux Format
Article
Open Success
Nov 17, 2020
“ClickHouse was developed for Yandex Metrics (the Russian equivalent of Google Analytics) as a data store and was Apache 2 licenced in 2016. In 2020. Altinity picked up $4m in funding to help it finish off a ClickHouse cloud service that’s in private
1 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read

Related categories

Skip carousel

Reviews for Programming MapReduce with Scalding

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Programming MapReduce with Scalding - Antonios Chalkiopoulos

Programming MapReduce with Scalding

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Introduction to MapReduce

The Hadoop platform

MapReduce

A MapReduce example

MapReduce abstractions

Introducing Cascading

What happens inside a pipe

Pipe assemblies

Cascading extensions

Summary

2. Get Ready for Scalding

Why Scala?

Scala basics

Scala build tools

Hello World in Scala

Development editors

Installing Hadoop in five minutes

Running our first Scalding job

Submitting a Scalding job in Hadoop

Summary

3. Scalding by Example

Reading and writing files

Best practices to read and write files

TextLine parsing

Executing in the local and Hadoop modes

Understanding the core capabilities of Scalding

Map-like operations

Join operations

Pipe operations

Grouping/reducing functions

Operations on groups

Composite operations

A simple example

Typed API

Summary

4. Intermediate Examples

Logfile analysis

Completing the implementation

Exploring ad targeting

Calculating daily points

Calculating historic points

Generating targeted ads

Summary

5. Scalding Design Patterns

The external operations pattern

The dependency injection pattern

The late bound dependency pattern

Summary

6. Testing and TDD

Introduction to testing

MapReduce testing challenges

Development lifecycle with testing strategy

TDD for Scalding developers

Implementing the TDD methodology

Decomposing the algorithm

Defining acceptance tests

Implementing integration tests

Implementing unit tests

Implementing the MapReduce logic

Defining and performing system tests

Black box testing

Summary

7. Running Scalding in Production

Executing Scalding in a Hadoop cluster

Scheduling execution

Coordinating job execution

Configuring using a property file

Configuring using Hadoop parameters

Monitoring Scalding jobs

Using slim JAR files

Scalding execution throttling

Summary

8. Using External Data Stores

Interacting with external systems

SQL databases

NoSQL databases

Understanding HBase

Reading from HBase

Writing in HBase

Using advanced HBase features

Search platforms

Elastic search

Summary

9. Matrix Calculations and Machine Learning

Text similarity using TF-IDF

Setting a similarity using the Jaccard index

K-Means using Mahout

Other libraries

Summary

Index

Programming MapReduce with Scalding

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2014

Production reference: 1190614

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-701-7

www.packtpub.com

Credits

Author

Antonios Chalkiopoulos

Reviewers

Ahmad Alkilani

Włodzimierz Bzyl

Tanin Na Nakorn

Sen Xu

Commissioning Editor

Owen Roberts

Acquisition Editor

Llewellyn Rozario

Content Development Editor

Sriram Neelakantan

Technical Editor

Kunal Anil Gaikwad

Copy Editors

Sayanee Mukherjee

Alfida Paiva

Project Coordinator

Aboli Ambardekar

Proofreaders

Mario Cecere

Maria Gould

Indexers

Mehreen Deshmukh

Rekha Nair

Tejal Soni

Graphics

Sheetal Aute

Ronak Dhruv

Valentina Dsilva

Disha Haria

Production Coordinator

Conidon Miranda

Cover Work

Conidon Miranda

Cover Image

Sheetal Aute

About the Author

Antonios Chalkiopoulos is a developer living in London and a professional working with Hadoop and Big Data technologies. He completed a number of complex MapReduce applications in Scalding into 40-plus production nodes HDFS Cluster. He is a contributor to Scalding and other open source projects, and he is interested in cloud technologies, NoSQL databases, distributed real-time computation systems, and machine learning.

He was involved in a number of Big Data projects before discovering Scala and Scalding. Most of the content of this book comes from his experience and knowledge accumulated while working with a great team of engineers.

I would like to thank Rajah Chandan for introducing Scalding to the team and being the author of SpyGlass and Stefano Galarraga for co-authoring chapters 5 and 6 and being the author of ScaldingUnit. Both these libraries are presented in this book.

Saad, Gracia, Deepak, and Tamas, I've learned a lot working next to you all, and this book wouldn't be possible without all your discoveries. Finally, I would like to thank Christina for bearing with my writing sessions and supporting all my endeavors.

About the Reviewers

Ahmad Alkilani is a data architect specializing in the implementation of high-performance distributed systems, data warehouses, and BI systems. His career has been split between building enterprise applications and products using a variety of web and database technologies, including .NET, SQL Server, Hadoop, Hive, Scala, and Scalding. His recent interests include building real-time web and predictive analytics and streaming and sketching algorithms.

Currently, Ahmad works at Move.com (http://www.realtor.com) and enjoys speaking at various user groups and national conferences, and he is an author on Pluralsight with courses focused on Hadoop and Big Data, SQL Server 2014, and more, targeting the Big Data and streaming spaces.

You can find more information on Ahmad on his LinkedIn profile (http://www.linkedin.com/in/ahmadalkilani) or his Pluralsight author page (http://pluralsight.com/training/Authors/Details/ahmad-alkilani).

I would like to thank my family, especially my wonderful wife, Farah, and my beautiful son Maher for putting up with my long working hours and always being there for me.

Włodzimierz Bzyl works at the University of Gdańsk. His current interests include web-related technologies and NoSQL databases.

He has a passion for new technologies and introducing his students to them.

He enjoys contributing to open source software and spending time trekking in the Tatra mountains.

Tanin Na Nakorn is a software engineer who is enthusiastic about building consumer products and open source projects that make people's lives easier. He cofounded Thaiware, a software portal in Thailand and GiveAsia, a donation platform in Singapore; he currently builds products at Twitter. You may find him expressing himself on his Twitter handle @tanin and helping on various open source projects at http://www.github.com/tanin47.

Sen Xu is a software engineer in Twitter; he was previously a data scientist in Inome Inc.

He worked on designing and building data pipelines on top of traditional RDBMS (MySQL, PostgreSQL, and so on) and key-value store solutions (Hadoop). His interests include Big Data analytics, text mining, record linkage, machine learning, and spatial data handling.

www.PacktPub.com

Support files, eBooks, discount offers, and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Scalding is a relatively new Scala DSL that builds on top of the Cascading pipeline framework, offering a powerful and expressive architecture for MapReduce applications. Scalding provides a highly abstracted layer for design and implementation in a componentized fashion, allowing code reuse and development with the Test Driven Methodology.

Similar to other popular MapReduce technologies such as Pig and Hive, Cascading uses a tuple-based data model, and it is a mature and proven framework that many dynamic languages have built technologies upon. Instead of forcing developers to write raw map and reduce functions while mentally keeping track of key-value pairs throughout the data transformation pipeline, Scalding provides a more natural way to express code.

In simpler terms, programming raw MapReduce is like developing in a low-level programming language such as assembly. On the other hand, Scalding provides an easier way to build complex MapReduce applications and integrates with other distributed applications of the Hadoop ecosystem.

This book aims to present MapReduce, Hadoop, and Scalding, it suggests design patterns and idioms, and it provides ample examples of real implementations for common use cases.

What this book covers

Chapter 1, Introduction to MapReduce, serves as an introduction to the Hadoop platform, MapReduce and to the concept of the pipeline abstraction that many Big Data technologies use. The first chapter outlines Cascading, which is a sophisticated framework that empowers developers to write efficient MapReduce applications.

Chapter 2, Get Ready for Scalding, lays the foundation for working with Scala, using build tools and an IDE, and setting up a local-development Hadoop system. It

Enjoying the preview?

Page 1 of 1

Programming MapReduce with Scalding

About this ebook

Antonios Chalkiopoulos

Related authors

Related to Programming MapReduce with Scalding

Related ebooks

Internet & Web For You

Related podcast episodes

Related articles

Related categories

Reviews for Programming MapReduce with Scalding

What did you think?

Book preview

Programming MapReduce with Scalding - Antonios Chalkiopoulos

Table of Contents

Programming MapReduce with Scalding

Programming MapReduce with Scalding

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers