Pentaho Data Integration Cookbook - Second Edition

Ebook1,183 pages4 hours

Pentaho Data Integration Cookbook - Second Edition

Name: Pentaho Data Integration Cookbook - Second Edition
Author: María Carina Roldán
ISBN: 9781783280681

By María Carina Roldán, Alex Meadows and Adrián Sergio Pulvirenti

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Pentaho Data Integration Cookbook Second Edition is written in a cookbook format, presenting examples in the style of recipes.This allows you to go directly to your topic of interest, or follow topics throughout a chapter to gain a thorough in-depth knowledge.Pentaho Data Integration Cookbook Second Edition is designed for developers who are familiar with the basics of Kettle but who wish to move up to the next level.It is also aimed at advanced users that want to learn how to use the new features of PDI as well as and best practices for working with Kettle.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateDec 2, 2013

ISBN9781783280681

Author

María Carina Roldán

Related authors

Skip carousel

Related to Pentaho Data Integration Cookbook - Second Edition

Related ebooks

Skip carousel

Pentaho Data Integration 4 Cookbook
Ebook
Pentaho Data Integration 4 Cookbook
byAdriÃ¡n Sergio Pulvirenti
Rating: 0 out of 5 stars
0 ratings
Business Intelligence Cookbook: A Project Lifecycle Approach Using Oracle Technology
Ebook
Business Intelligence Cookbook: A Project Lifecycle Approach Using Oracle Technology
byJohn Heaton
Rating: 0 out of 5 stars
0 ratings
Windows Application Development Cookbook
Ebook
Windows Application Development Cookbook
byMarcin Jamro
Rating: 0 out of 5 stars
0 ratings
MDX with Microsoft SQL Server 2016 Analysis Services Cookbook - Third Edition
Ebook
MDX with Microsoft SQL Server 2016 Analysis Services Cookbook - Third Edition
bySherry Li
Rating: 0 out of 5 stars
0 ratings
MDX with SSAS 2012 Cookbook
Ebook
MDX with SSAS 2012 Cookbook
bySherry Li
Rating: 0 out of 5 stars
0 ratings
PostgreSQL 9 High Availability Cookbook
Ebook
PostgreSQL 9 High Availability Cookbook
byShaun M. Thomas
Rating: 5 out of 5 stars
5/5
SQL Server Analysis Services 2012 Cube Development Cookbook
Ebook
SQL Server Analysis Services 2012 Cube Development Cookbook
byPaul Turley
Rating: 0 out of 5 stars
0 ratings
PostgreSQL 9 Administration Cookbook - Second Edition
Ebook
PostgreSQL 9 Administration Cookbook - Second Edition
bySimon Riggs
Rating: 0 out of 5 stars
0 ratings
D Cookbook
Ebook
D Cookbook
byAdam D. Ruppe
Rating: 0 out of 5 stars
0 ratings
Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5
QlikView for Developers Cookbook
Ebook
QlikView for Developers Cookbook
byStephen Redmond
Rating: 0 out of 5 stars
0 ratings
Microsoft Team Foundation Server 2015 Cookbook
Ebook
Microsoft Team Foundation Server 2015 Cookbook
byArora Tarun
Rating: 0 out of 5 stars
0 ratings
Pentaho Data Integration Beginner's Guide
Ebook
Pentaho Data Integration Beginner's Guide
byMaria Carina Roldan
Rating: 4 out of 5 stars
4/5
Relational Databases: State of the Art Report 14:5
Ebook
Relational Databases: State of the Art Report 14:5
byD A Bell
Rating: 0 out of 5 stars
0 ratings
Data Warehouse Architecture A Complete Guide - 2021 Edition
Ebook
Data Warehouse Architecture A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Oracle Warehouse Builder 11g: Getting Started
Ebook
Oracle Warehouse Builder 11g: Getting Started
byBob Griesemer
Rating: 0 out of 5 stars
0 ratings
Pentaho 3.2 Data Integration Beginner's Guide
Ebook
Pentaho 3.2 Data Integration Beginner's Guide
byMaria Carina Roldan
Rating: 0 out of 5 stars
0 ratings
ETL A Clear and Concise Reference
Ebook
ETL A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
Ebook
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
byBahaaldine Azarmi
Rating: 0 out of 5 stars
0 ratings
Logical Data Warehouse A Complete Guide - 2019 Edition
Ebook
Logical Data Warehouse A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data Architecture A Complete Guide - 2019 Edition
Ebook
Big Data Architecture A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset
Ebook
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset
byJoy Mundy
Rating: 0 out of 5 stars
0 ratings
Azure Data Lake A Complete Guide - 2019 Edition
Ebook
Azure Data Lake A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Logical data model A Clear and Concise Reference
Ebook
Logical data model A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Change data capture Third Edition
Ebook
Change data capture Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Enterprise Architecture at Work: Modelling, Communication and Analysis
Ebook
Enterprise Architecture at Work: Modelling, Communication and Analysis
byMarc Lankhorst
Rating: 2 out of 5 stars
2/5
DataOps A Complete Guide - 2019 Edition
Ebook
DataOps A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Semantic Data Model A Complete Guide - 2020 Edition
Ebook
Semantic Data Model A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Mastering Data Warehouse Design: Relational and Dimensional Techniques
Ebook
Mastering Data Warehouse Design: Relational and Dimensional Techniques
byClaudia Imhoff
Rating: 4 out of 5 stars
4/5

Applications & Software For You

Skip carousel

Sound Design for Filmmakers: Film School Sound
Ebook
Sound Design for Filmmakers: Film School Sound
byMurray Stiller
Rating: 5 out of 5 stars
5/5
Logic Pro X For Dummies
Ebook
Logic Pro X For Dummies
byGraham English
Rating: 0 out of 5 stars
0 ratings
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
Ebook
Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
byKyle Brach
Rating: 5 out of 5 stars
5/5
Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing
Ebook
Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing
byLois Alba
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Adobe Photoshop: A Complete Course and Compendium of Features
Ebook
Adobe Photoshop: A Complete Course and Compendium of Features
byStephen Laskevitch
Rating: 5 out of 5 stars
5/5
Adobe Premiere Pro: A Complete Course and Compendium of Features
Ebook
Adobe Premiere Pro: A Complete Course and Compendium of Features
byBen Goldsmith
Rating: 0 out of 5 stars
0 ratings
GarageBand For Dummies
Ebook
GarageBand For Dummies
byBob LeVitus
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Audio For Authors: Audiobooks, Podcasting, And Voice Technologies
Ebook
Audio For Authors: Audiobooks, Podcasting, And Voice Technologies
byJoanna Penn
Rating: 5 out of 5 stars
5/5
Adobe Illustrator: A Complete Course and Compendium of Features
Ebook
Adobe Illustrator: A Complete Course and Compendium of Features
byJason Hoppe
Rating: 0 out of 5 stars
0 ratings
The Best Hacking Tricks for Beginners
Ebook
The Best Hacking Tricks for Beginners
byRAJ TYAGI
Rating: 4 out of 5 stars
4/5
Blender 3D Basics Beginner's Guide Second Edition
Ebook
Blender 3D Basics Beginner's Guide Second Edition
byGordon Fisher
Rating: 5 out of 5 stars
5/5
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
Ebook
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
byScreech House
Rating: 3 out of 5 stars
3/5
FL Studio Cookbook
Ebook
FL Studio Cookbook
byShaun Friedman
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT
Ebook
Mastering ChatGPT
byCharles J. Jones
Rating: 0 out of 5 stars
0 ratings
Adobe After Effects: A Complete Course and Compendium of Features
Ebook
Adobe After Effects: A Complete Course and Compendium of Features
byBen Goldsmith
Rating: 0 out of 5 stars
0 ratings
Six Figure Blogging In 3 Months
Ebook
Six Figure Blogging In 3 Months
byShekhar Mishra
Rating: 4 out of 5 stars
4/5
YouTube Growth Mastery: How to Start & Grow A Successful Youtube Channel. Get More Views, Subscribers, Hack The Algorithm, Make Money & Master YouTube
Ebook
YouTube Growth Mastery: How to Start & Grow A Successful Youtube Channel. Get More Views, Subscribers, Hack The Algorithm, Make Money & Master YouTube
byMax Lane
Rating: 3 out of 5 stars
3/5
GarageBand Basics: The Complete Guide to GarageBand: Music
Ebook
GarageBand Basics: The Complete Guide to GarageBand: Music
byAventuras De Viaje
Rating: 0 out of 5 stars
0 ratings
How Do I Do That In InDesign?
Ebook
How Do I Do That In InDesign?
byDave Clayton
Rating: 5 out of 5 stars
5/5
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
Ebook
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
byEMC Education Services
Rating: 0 out of 5 stars
0 ratings
Significant Zero: Heroes, Villains, and the Fight for Art and Soul in Video Games
Ebook
Significant Zero: Heroes, Villains, and the Fight for Art and Soul in Video Games
byWalt Williams
Rating: 4 out of 5 stars
4/5
80 Ways to Use ChatGPT in the Classroom
Ebook
80 Ways to Use ChatGPT in the Classroom
byStan Skrabut
Rating: 5 out of 5 stars
5/5
iPhone Photography For Dummies
Ebook
iPhone Photography For Dummies
byMark Hemmings
Rating: 0 out of 5 stars
0 ratings
Canon EOS Rebel T3/1100D For Dummies
Ebook
Canon EOS Rebel T3/1100D For Dummies
byJulie Adair King
Rating: 5 out of 5 stars
5/5
OneNote Recipes: Elegant Techniques for Problem Solving and Visual Thinking
Ebook
OneNote Recipes: Elegant Techniques for Problem Solving and Visual Thinking
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
Podcast episode
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
#121 — ChatGPT and How Generative AI is Augmenting Workflows
Podcast episode
#121 — ChatGPT and How Generative AI is Augmenting Workflows
byDataFramed
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
CockroachDB In Depth with Peter Mattis - Episode 35
Podcast episode
CockroachDB In Depth with Peter Mattis - Episode 35
byData Engineering Podcast
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
Podcast episode
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
byData Engineering Podcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
DataFramed Careers Series Special Announcement!
Podcast episode
DataFramed Careers Series Special Announcement!
byDataFramed
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
The Top Trends in 2022 for Data Leaders from DataRobot, Databricks, and Google: On this episode of The Data Chief, top data and analytics executives from DataRobot, Databricks, and Google join Cindi to discuss trends shaping the future of analytics and provide bold predictions for the upcoming year.
Podcast episode
The Top Trends in 2022 for Data Leaders from DataRobot, Databricks, and Google: On this episode of The Data Chief, top data and analytics executives from DataRobot, Databricks, and Google join Cindi to discuss trends shaping the future of analytics and provide bold predictions for the upcoming year.
byThe Data Chief
0 ratings
0% found this document useful
The impact of Artificial Intelligence on Enterprise Architecture
Podcast episode
The impact of Artificial Intelligence on Enterprise Architecture
byEnterprise Architecture Podcast
0 ratings
0% found this document useful
#79 - Domain-Driven Design With Functional Programming - Scott Wlaschin
Podcast episode
#79 - Domain-Driven Design With Functional Programming - Scott Wlaschin
byTech Lead Journal
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
Podcast episode
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
byAWS Podcast
0 ratings
0% found this document useful
Cloud Native Data Security As Code With Cyral - Episode 156: An interview about the Cyral platform and how it enforces data security as code for protecting databases and object storage in the cloud.
Podcast episode
Cloud Native Data Security As Code With Cyral - Episode 156: An interview about the Cyral platform and how it enforces data security as code for protecting databases and object storage in the cloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Best Integration Practices for Architecture Automation | BiZZdesign
Podcast episode
Best Integration Practices for Architecture Automation | BiZZdesign
byEnterprise Architecture Podcast
0 ratings
0% found this document useful
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
Podcast episode
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
byThe Data Chief
0 ratings
0% found this document useful
Data Modeling That Evolves With Your Business Using Data Vault - Episode 119: An interview about the data vault method of data modeling and how it simplifies integrating the evolving data sources that you are dealing with in your enterprise data warehouse
Podcast episode
Data Modeling That Evolves With Your Business Using Data Vault - Episode 119: An interview about the data vault method of data modeling and how it simplifies integrating the evolving data sources that you are dealing with in your enterprise data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Scaling Data Governance For Global Businesses With A Data Hub Architecture - Episode 123: An interview about how a data hub architecture can reduce the overhead of managing data governance and compliance across an organization
Podcast episode
Scaling Data Governance For Global Businesses With A Data Hub Architecture - Episode 123: An interview about how a data hub architecture can reduce the overhead of managing data governance and compliance across an organization
byData Engineering Podcast
0 ratings
0% found this document useful
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
Podcast episode
Investing In Understanding The Customer Journey At American Express: An interview with Purvi Shah about the Customer 360 project at American Express and their journey into the cloud for enterprise data management
byData Engineering Podcast
0 ratings
0% found this document useful
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
Podcast episode
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
byAnalytics on Fire
0 ratings
0% found this document useful
Real World Change Data Capture At Datacoral: An interview with Raghu Murthy about the reality of running and maintaining change data capture pipelines for customers at Datacoral.
Podcast episode
Real World Change Data Capture At Datacoral: An interview with Raghu Murthy about the reality of running and maintaining change data capture pipelines for customers at Datacoral.
byData Engineering Podcast
0 ratings
0% found this document useful
gRPC & protocol buffers: with Askhay Shah
Podcast episode
gRPC & protocol buffers: with Askhay Shah
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful

Skip carousel

Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
Why Are We Stuck With M.2 When U.2 Is So Much Better?
APC
Article
Why Are We Stuck With M.2 When U.2 Is So Much Better?
May 22, 2023
4 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Collect And Graph Metrics With Python
Linux Format
Article
Collect And Graph Metrics With Python
May 4, 2021
7 min read
Revisit The Arcade Classic Pong In Python
Linux Format
Article
Revisit The Arcade Classic Pong In Python
Jul 28, 2020
This series of building retro games in Python has so far seen us coding a lunar landing space module, a side-scrolling platformer, the famous pellet-munching, ghost-chasing Pac-Man, and in this issue we’re going to develop our own version of Pong! To
7 min read
Come Get Some
Retro Gamer
Article
Come Get Some
Jul 6, 2023
Blaze Entertainment recently excited Evercade owners with six new cartridge packs that will be heading to its systems by the year’s end. Utilising a Nintendo-style Direct structure, Evercade Showcase Vol 1 proved a big success for Blaze, with its hea
3 min read
Budget Strategies for Maximizing Big Data
Entrepreneur
Article
Budget Strategies for Maximizing Big Data
Jun 1, 2016
1 min read
The Three Cornerstones of a Smart Business
Rotman Management
Article
The Three Cornerstones of a Smart Business
Jan 1, 2019
Adaptable Products. Algorithms cannot iterate without the products—the online consumer interface that delivers customer experience directly while gathering consumer feedback to adjust algorithm models. Google’s search bar is a classic example of prod
1 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Filesystems
Linux Format
Article
Filesystems
Nov 16, 2021
1 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
MARIADB Optimise And Control Your Databases
Linux Format
Article
MARIADB Optimise And Control Your Databases
Jul 30, 2019
9 min read
Mac 911
MacWorld
Article
Mac 911
Sep 18, 2018
5 min read
Mac 911
MacWorld
Article
Mac 911
Apr 20, 2021
7 min read
Scan And Scrape Websites Using Python
Linux Format
Article
Scan And Scrape Websites Using Python
Nov 14, 2023
David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out! Ever since the web made an appearance back in the mid-’90s, programmers have been writing softw
6 min read
Manage Zip And Other Archives With PeaZip
iCreate
Article
Manage Zip And Other Archives With PeaZip
Aug 11, 2022
2 min read
One Tree To Rule Them All
Family Tree
Article
One Tree To Rule Them All
Apr 19, 2022
7 min read
Reboot Windows 11 File Explorer
APC
Article
Reboot Windows 11 File Explorer
Oct 3, 2022
10 min read
Deep Into Storage Space
Maximum PC
Article
Deep Into Storage Space
Jun 25, 2019
8 min read
Mac 911
MacWorld
Article
Mac 911
Mar 20, 2018
6 min read
Deep Into Storage Space
APC
Article
Deep Into Storage Space
Oct 7, 2019
8 min read
Organise Your Folders Like An Expert
Computeractive
Article
Organise Your Folders Like An Expert
Jun 22, 2022
7 min read
Mac 911
MacWorld
Article
Mac 911
Mar 15, 2018
6 min read
Organise Your Folders Like An Expert
APC
Article
Organise Your Folders Like An Expert
Aug 8, 2022
7 min read
Build A Streaming Ebook Library
Linux Format
Article
Build A Streaming Ebook Library
Apr 2, 2024
10 min read
TECHY TIPS For Family Historians
Family Tree UK
Article
TECHY TIPS For Family Historians
Mar 10, 2020
5 min read
Finish Your Cataloguing App
Linux Format
Article
Finish Your Cataloguing App
Jan 10, 2023
Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. In his spare time, Matt enjoys listening to music and reading. More featurepacked source code for this project can be downlo
7 min read

Related categories

Skip carousel

Reviews for Pentaho Data Integration Cookbook - Second Edition

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Pentaho Data Integration Cookbook - Second Edition - María Carina Roldán

Pentaho Data Integration Cookbook Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Working with Databases

Introduction

Sample databases

Pentaho BI platform databases

Connecting to a database

Getting ready

How to do it...

How it works...

There's more...

Avoiding creating the same database connection over and over again

Avoiding modifying jobs and transformations every time a connection changes

Specifying advanced connection properties

Connecting to a database not supported by Kettle

Checking the database connection at runtime

Getting data from a database

Getting ready

How to do it...

How it works...

There's more...

See also

Getting data from a database by providing parameters

Getting ready

How to do it...

How it works...

There's more...

Parameters coming in more than one row

Executing the SELECT statement several times, each for a different set of parameters

See also

Getting data from a database by running a query built at runtime

Getting ready

How to do it...

How it works...

There's more...

See also

Inserting or updating rows in a table

Getting ready

How to do it...

How it works...

There's more...

Alternative solution if you just want to insert records

Alternative solution if you just want to update rows

Alternative way for inserting and updating

See also

Inserting new rows where a simple primary key has to be generated

Getting ready

How to do it...

How it works...

There's more...

Using the Combination lookup/update for looking up

See also

Inserting new rows where the primary key has to be generated based on stored values

Getting ready

How to do it...

How it works...

There's more...

See also

Deleting data from a table

Getting ready

How to do it...

How it works...

See also

Creating or altering a database table from PDI (design time)

Getting ready

How to do it...

How it works...

There's more...

See also

Creating or altering a database table from PDI (runtime)

How to do it...

How it works...

There's more...

See also

Inserting, deleting, or updating a table depending on a field

Getting ready

How to do it...

How it works...

There's more...

Insert, update, and delete all-in-one

Synchronizing after merge

See also

Changing the database connection at runtime

Getting ready

How to do it...

How it works...

There's more...

See also

Loading a parent-child table

Getting ready

How to do it...

How it works...

See also

Building SQL queries via database metadata

Getting ready

How to do It...

How it works...

See also

Performing repetitive database design tasks from PDI

Getting ready

How to do It...

How it works...

See also

2. Reading and Writing Files

Introduction

Reading a simple file

Getting ready

How to do it...

How it works...

There's more...

Alternative notation for a separator

About file format and encoding

About data types and formats

Altering the names, order, or metadata of the fields coming from the file

Reading files with fixed width fields

Reading several files at the same time

Getting ready

How to do it...

How it works...

There's more...

Reading semi-structured files

Getting ready

How to do it...

How it works...

There's more...

Master/detail files

Logfiles

See also

Reading files having one field per row

Getting ready

How to do it...

How it works...

There's more...

See also

Reading files with some fields occupying two or more rows

Getting ready

How to do it...

How it works...

See also

Writing a simple file

Getting ready

How to do it...

How it works...

There's more...

Changing headers

Giving the output fields a format

Writing a semi-structured file

Getting ready

How to do it...

How it works...

There's more...

Providing the name of a file (for reading or writing) dynamically

Getting ready

How to do it...

How it works...

There's more...

Get System Info

Generating several files simultaneously with the same structure, but different names

Using the name of a file (or part of it) as a field

Getting ready

How to do it...

How it works...

Reading an Excel file

Getting ready

How to do it...

How it works...

See also

Getting the value of specific cells in an Excel file

Getting ready

How to do it...

How it works...

There's more...

Looking for a given cell

Writing an Excel file with several sheets

Getting ready

How to do it...

How it works...

There's more...

See also

Writing an Excel file with a dynamic number of sheets

Getting ready

How to do it...

How it works...

See also

Reading data from an AWS S3 Instance

Getting ready

How to do it...

How it works...

See also

3. Working with Big Data and Cloud Sources

Introduction

Loading data into Salesforce.com

Getting ready

How to do it...

How it works...

See also

Getting data from Salesforce.com

Getting ready

How to do it...

How it works...

See also

Loading data into Hadoop

Getting ready

How to do it...

How it works...

There's more...

See also

Getting data from Hadoop

Getting ready

How to do it...

How it works...

See also

Loading data into HBase

Getting ready

How to do it...

How it works...

There's more...

See also

Getting data from HBase

Getting ready

How to do it...

How it works...

See also

Loading data into MongoDB

Getting ready

How to do it...

How it works...

See also

Getting data from MongoDB

Getting ready

How to do it...

How it works...

See also

4. Manipulating XML Structures

Introduction

Reading simple XML files

Getting ready

How to do it...

How it works...

There's more...

XML data in a field

XML file name in a field

See also

Specifying fields by using the Path notation

Getting ready

How to do it...

How it works...

There's more...

Getting data from a different path

Getting data selectively

Getting more than one node when the nodes share their Path notation

Saving time when specifying Path

Validating well-formed XML files

Getting ready

How to do it...

How it works...

See also

Validating an XML file against DTD definitions

Getting ready

How to do it...

How it works...

There's more...

See also

Validating an XML file against an XSD schema

Getting ready

How to do it...

How it works...

There's more...

See also

Generating a simple XML document

Getting ready

How to do it...

How it works...

There's more...

Generating fields with XML structures

See also

Generating complex XML structures

Getting ready

How to do it...

How it works...

See also

Generating an HTML page using XML and XSL transformations

Getting ready

How to do it...

How it works...

There's more...

See also

Reading an RSS Feed

Getting ready

How to do it...

How it works...

See also

Generating an RSS Feed

Getting ready

How to do it...

How it works

There's more...

See also

5. File Management

Introduction

Copying or moving one or more files

Getting ready

How to do it...

How it works...

There's more...

Moving files

Detecting the existence of the files before copying them

Creating folders

See also

Deleting one or more files

Getting ready

How to do it...

How it works...

There's more...

Figuring out which files have been deleted

See also

Getting files from a remote server

How to do it...

How it works...

There's more...

Specifying files to transfer

Some considerations about connecting to an FTP server

Access via SFTP

Access via FTPS

Getting information about the files being transferred

See also

Putting files on a remote server

Getting ready

How to do it...

How it works...

There's more...

See also

Copying or moving a custom list of files

Getting ready

How to do it...

How it works...

See also

Deleting a custom list of files

Getting ready

How to do it...

How it works...

See also

Comparing files and folders

Getting ready

How to do it...

How it works...

There's more...

Comparing folders

Working with ZIP files

Getting ready

How to do it...

How it works...

There's more...

Avoiding zipping files

Avoiding unzipping files

See also

Encrypting and decrypting files

Getting ready

How to do it...

How it works...

There's more...

See also

6. Looking for Data

Introduction

Looking for values in a database table

Getting ready

How to do it...

How it works...

There's more...

Taking some action when the lookup fails

Taking some action when there are too many results

Looking for non-existent data

See also

Looking for values in a database with complex conditions

Getting ready

How to do it...

How it works...

There's more...

See also

Looking for values in a database with dynamic queries

Getting ready

How to do it...

How it works...

There's more...

See also

Looking for values in a variety of sources

Getting ready

How to do it...

How it works...

There's more...

Looking for alternatives when the Stream Lookup step doesn't meet your needs

Speeding up your transformation

Using the Value Mapper step for looking up from a short list of values

See also

Looking for values by proximity

Getting ready

How to do it...

How it works...

There's more...

Looking for values by using a web service

Getting ready

How to do it...

How it works...

There's more...

See also

Looking for values over intranet or the Internet

Getting ready

How to do it...

How it works...

There's more...

See also

Validating data at runtime

Getting ready

How to do it...

How it works...

There's more...

See also

7. Understanding and Optimizing Data Flows

Introduction

Splitting a stream into two or more streams based on a condition

Getting ready

How to do it...

How it works...

There's more...

Avoiding the use of Dummy steps

Comparing against the value of a Kettle variable

Avoiding the use of nested Filter rows steps

Overcoming the difficulties of complex conditions

Merging rows of two streams with the same or different structures

Getting ready

How to do it...

How it works...

There's more...

Making sure that the metadata of the streams is the same

Telling Kettle how to merge the rows of your streams

See also

Adding checksums to verify datasets

Getting ready

How to do it...

How it works...

Comparing two streams and generating differences

Getting ready

How to do it...

How it works...

There's more...

Using the differences to keep a table up-to-date

See also

Generating all possible pairs formed from two datasets

How to do it...

How it works...

There's more...

Getting variables in the middle of the stream

Limiting the number of output rows

See also

Joining two or more streams based on given conditions

Getting ready

How to do it...

How it works...

There's more...

See also

Interspersing new rows between existent rows

Getting ready

How to do it...

How it works...

See also

Executing steps even when your stream is empty

Getting ready

How to do it...

How it works...

There's more...

Processing rows differently based on the row number

Getting ready

How to do it...

How it works...

There's more...

Identifying specific rows

Identifying the last row in the stream

Avoiding using an Add sequence step to enumerate the rows

See also

Processing data into shared transformations via filter criteria and subtransformations

Getting ready

How to do it...

How it works...

See also

Altering a data stream with Select values

How to do it...

How it works...

Processing multiple jobs or transformations in parallel

How to do it...

How it works...

See also

8. Executing and Re-using Jobs and Transformations

Introduction

Sample transformations

Sample transformation – hello

Sample transformation – random list

Sample transformation – sequence

Sample transformation – file list

Launching jobs and transformations

How to do it...

How it works...

Executing a job or a transformation by setting static arguments and parameters

Getting ready

How to do it...

How it works...

There's more...

See also

Executing a job or a transformation from a job by setting arguments and parameters dynamically

Getting ready

How to do it...

How it works...

There's more...

See also

Executing a job or a transformation whose name is determined at runtime

Getting ready

How to do it...

How it works...

There's more...

See also

Executing part of a job once for every row in a dataset

Getting ready

How to do it...

How it works...

There's more...

Accessing the copied rows from jobs, transformations, and other entries

Executing a transformation once for every row in a dataset

Executing a transformation or part of a job once for every file in a list of files

See also

Executing part of a job several times until a condition is true

Getting ready

How to do it...

How it works...

There's more...

Implementing loops in a job

Using the JavaScript step to control the execution of the entries in your job

See also

Creating a process flow

Getting ready

How to do it...

How it works...

There's more...

Serializing/De-serializing data

Other means for transferring or sharing data between transformations

Moving part of a transformation to a subtransformation

Getting ready

How to do it...

How it works...

There's more...

Using Metadata Injection to re-use transformations

Getting ready

How to do it...

How it works...

There's more...

9. Integrating Kettle and the Pentaho Suite

Introduction

A sample transformation

Creating a Pentaho report with data coming from PDI

Getting ready

How to do it...

How it works...

There's more...

Creating a Pentaho report directly from PDI

Getting ready

How to do it...

How it works...

There's more...

See also

Configuring the Pentaho BI Server for running PDI jobs and transformations

Getting ready

How to do it...

How it works...

There's more...

See also

Executing a PDI transformation as part of a Pentaho process

Getting ready

How to do it...

How it works...

There's more...

Specifying the location of the transformation

Supplying values for named parameters, variables and arguments

Keeping things simple when it's time to deliver a plain file

See also

Executing a PDI job from the Pentaho User Console

Getting ready

How to do it...

How it works...

There's more...

See also

Generating files from the PUC with PDI and the CDA plugin

Getting ready

How to do it...

How it works...

There's more...

Populating a CDF dashboard with data coming from a PDI transformation

Getting ready

How to do it...

How it works...

There's more...

10. Getting the Most Out of Kettle

Introduction

Sending e-mails with attached files

Getting ready

How to do it...

How it works...

There's more...

Sending logs through an e-mail

Sending e-mails in a transformation

Generating a custom logfile

Getting ready

How to do it...

How it works...

There's more...

Filtering the logfile

Creating a clean logfile

Isolating logfiles for different jobs or transformations

See also

Running commands on another server

Getting ready

How to do it...

How it works...

See also

Programming custom functionality

Getting ready

How to do it...

How it works...

There's more...

Data type's equivalence

Generalizing your UDJC code

Looking up information with additional steps

Customizing logs

Scripting alternatives to the UDJC step

Generating sample data for testing purposes

How to do it...

How it works...

There's more...

Using a Data grid step to generate specific data

Working with subsets of your data

See also

Working with JSON files

Getting ready

How to do it...

How it works...

There's more...

Reading JSON files dynamically

Writing JSON files

Getting information about transformations and jobs (file-based)

Getting ready

How to do it...

How it works...

There's more...

Job XML nodes

Steps and entries information

See also

Getting information about transformations and jobs (repository-based)

Getting ready

How to do it...

How it works...

There's more...

Transformation tables

Job tables

Database connections tables

Using Spoon's built-in optimization tools

Getting ready

How to do it...

How it works...

There's more...

11. Utilizing Visualization Tools in Kettle

Introduction

Managing plugins with the Marketplace

Getting ready

How to do it...

How it works...

There's more...

See also

Data profiling with DataCleaner

Getting ready

How to do it...

How it works...

There's more...

See also

Visualizing data with AgileBI

Getting ready

How to do it...

How it works...

There's more...

See also

Using Instaview to analyze and visualize data

Getting ready

How to do it...

How it works...

There's more...

See also

12. Data Analytics

Introduction

Reading data from a SAS datafile

Why read a SAS file?

Getting ready

How to do it...

How it works...

See also

Studying data via stream statistics

Getting ready

How to do it...

How it works...

See also

Building a random data sample for Weka

Getting ready

How to do it...

How it works...

There's more...

See also

A. Data Structures

Books data structure

Books

Authors

museums data structure

museums

cities

outdoor data structure

products

categories

Steel Wheels data structure

Lahman Baseball Database

B. References

Books

Online

Index

Pentaho Data Integration Cookbook Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2011

Second Edition: November 2013

Production Reference: 1151113

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-067-4

www.packtpub.com

Cover Image by Aniket Sawant (<aniket_sawant_photography@hotmail.com>)

Credits

Author

Alex Meadows

Adrián Sergio Pulvirenti

María Carina Roldán

Reviewers

Wesley Seidel Carvalho

Daniel Lemire

Coty Sutherland

Acquisition Editor

Meeta Rajani

Lead Technical Editor

Arvind Koul

Technical Editors

Dennis John

Adrian Raposo

Gaurav Thingalaya

Project Coordinator

Wendell Palmer

Proofreader

Kevin McGowan

Indexer

Monica Ajmera Mehta

Graphics

Ronak Dhruv

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Alex Meadows has worked with open source Business Intelligence solutions for nearly 10 years and has worked in various industries such as plastics manufacturing, social and e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in Pentaho and other open source communities to learn, share, and help newcomers with the best practices in BI, analytics, and data management. He received his Bachelor's degree in Business Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.

First and foremost, thank you Christina for being there for me before, during, and after taking on the challenge of writing and revising a book. I know it's not been easy, but thank you for allowing me the opportunity. To my grandmother, thank you for teaching me at a young age to always go for goals that may just be out of reach. Finally, this book would be no where without the Pentaho community and the friends I've made over the years being a part of it.

Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.

He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

I'd like to thank my lovely kids, Camila and Nicolas, who understood that I couldn't share with them the usual video game sessions during the writing process. I'd also like to thank my wife, who introduced me to the Pentaho world.

María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelor's degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in Statistics at the University of Buenos Aires (UBA) in Buenos Aires city, where she has been living since 1994.

She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently, she works for Webdetails, one of the main Pentaho contributors. She is the author of Pentaho 3.2 Data Integration: Beginner's Guide published by Packt Publishing in April 2010.

You can follow her on Twitter at @mariacroldan.

I'd like to thank those who have encouraged me to write this book: On one hand, the Pentaho community; they have given me a rewarding feedback after the Beginner's book. On the other side, my husband, who without hesitation, agreed to write the book with me. Without them I'm not sure I would have embarked on a new book project.

I'd also like to thank the technical reviewers for the time and dedication that they have put in reviewing the book. In particular, thanks to my colleagues at Webdetails; it's a pleasure and a privilege to work with them every day.

About the Reviewers

Wesley Seidel Carvalho got his Master's degree in Computer Science from the Institute of Mathematics and Statistics, University of São Paulo (IME-USP), Brazil, where he researched on (his dissertation) Natural Language Processing (NLP) for the Portuguese language. He is a Database Specialist from the Federal University of Pará (UFPa). He has a degree in Mathematics from the State University of Pará (Uepa).

Since 2010, he has been working with Pentaho and researching Open Data government. He is an active member of the communities and lists of Free Software, Open Data, and Pentaho in Brazil, contributing software Grammar Checker for OpenOffice - CoGrOO and CoGrOO Community.

He has worked with technology, database, and systems development since 1997, Business Intelligence since 2003, and has been involved with Pentaho and NLP since 2009. He is currently serving its customers through its startups:

http://intelidados.com.br

http://ltasks.com.br

Daniel Lemire has a B.Sc. and a M.Sc. in Mathematics from the University of Toronto, and a Ph.D. in Engineering Mathematics from the Ecole Polytechnique and the Université de Montréal. He is a Computer Science professor at TELUQ (Université du Québec) where he teaches Primarily Online. He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including more than 25 journal articles. He has held competitive research grants for the last 15 years. He has served as a program committee member on leading computer science conferences (for example, ACM CIKM, ACM WSDM, and ACM RecSys). His open source software has been used by major corporations such as Google and Facebook. His research interests include databases, information retrieval, and high performance programming. He blogs regularly on computer science at http://lemire.me/blog/.

Coty Sutherland was first introduced to computing around the age of 10. At that time, he was immersed in various aspects of computers and it became apparent that he had a propensity for software manipulation. From then until now, he has stayed involved in learning new things in the software space and adapting to the changing environment that is Software Development. He graduated from Appalachian State University in 2009 with a Bachelor's Degree in Computer Science. After graduation, he focused mainly on software application development and support, but recently transitioned to the Business Intelligence field to pursue new and exciting things with data. He is currently employed by the open source company, Red Hat, as a Business Intelligence Engineer.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

Preface

Pentaho Data Integration (also known as Kettle) is one of the leading open source data integration solutions. With Kettle, you can take data from a multitude of sources, transform and conform the data to given requirements, and load the data into just as many target systems. Not only is PDI capable of transforming and cleaning data, it also provides an ever-growing number of plugins to augment what is already a very robust list of features.

Pentaho Data Integration Cookbook Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics working with Big Data and cloud sources, data analytics, and more.

Pentaho Data Integration Cookbook Second Edition shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The book starts with showing you how to work with data sources files, relational databases, . Then we go into how to work with data streams merging data from different sources, how to take advantage of the different tools to clean up and transform data, and how to build nested jobs and transformations. More advanced topics are also covered, data analytics, data visualization, plugins, and integration of Kettle with other tools in the Pentaho suite.

Pentaho Data Integration Cookbook Second Edition provides recipes with easy step-by-step instructions to accomplish specific tasks. The code for the recipes can be adapted and built upon to meet individual needs.

What this book covers

Chapter 1, Working with Databases, shows you how to work with relational databases with Kettle. The recipes show you how to create and share database connections, perform typical database functions (select, insert, update,delete), as well as more advanced tricks building and executing queries at runtime.

Chapter 2, Reading and Writing Files, not only shows you how to read and write files, but also how to work with semi-structured files, and read data from Amazon Web Services.

Chapter 3, Working with Big Data and Cloud Sources, covers how to load and read data from some of the many different NoSQL data sources as well as from Salesforce.com.

Chapter 4, Manipulating XML Structures, shows you how to read, write, and validate XML. Simple and complex XML structures are shown as well as more specialized formats RSS feeds.

Chapter 5, File Management, demonstrates how to copy, move, transfer, and encrypt files and directories.

Chapter 6, Looking for Data, shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle's builtin validation steps.

Chapter 7, Understanding and Optimizing Data Flows, details how Kettle moves data through jobs and transformations and how to optimize data flows.

Chapter 8, Executing and Re-using Jobs and Transformations, shows you how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Objectoriented transformations through subtransformations also explained.

Chapter 9, Integrating Kettle and the Pentaho Suite, works with some of the other tools in the Pentaho suite to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more.

Chapter 10, Getting the Most Out of Kettle, works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle's database repository.

Chapter 11, Utilizing Visualization Tools in Kettle, explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.

Chapter 12, Data Analytics, shows you how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka.

Appendix A, Data Structures, shows the different data structures used throughout the book.

Appendix B, References, provides a list of books and other resources that will help you connect with the rest of the Pentaho community and learn more about Kettle and the other tools that are part of the Pentaho suite.

What you need for this book

PDI is written in Java. Any operating system that can run JVM 1.5 or higher should be able to run PDI. Some of the recipes will require other software, as listed:

Hortonworks Sandbox: This is Hadoop in a box, a great environment to learn how to work with NoSQL solutions without having to install everything.

Web Server with ASP support: This is needed for two recipes to show how to work with web services.

DataCleaner: This is one of the top open source data profiling tools and integrates with Kettle.

MySQL: All the relational database recipes have scripts for MySQL provided. Feel free to use another relational database for those recipes.

In addition, it's recommended to have access to Excel or Calc and a decent text editor (like Notepad++ or gedit).

Having access to an Internet connection will be useful for some of the recipes that use cloud services, as well as making it possible to access the additional links that provide more information about given topics throughout the book.

Who this book is for

If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you! This book does not cover the basics of PDI, SQL, database theory, data profiling, and data analytics.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Copy the .jar file containing the driver to the lib directory inside the Kettle installation directory.

A block of code is set as follows:

lastname,firstname,country,birthyear

Larsson,Stieg,Swedish,1954

King,Stephen,American,1947

Hiaasen,Carl ,American,1953

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

City

Buenos Aires, Argentina

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: clicking the Next button moves you to the next screen.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Working with Databases

In this chapter, we will cover:

Connecting to a database

Getting data from a database

Getting data from a database by providing parameters

Getting data from a database by running a query built at runtime

Inserting or updating rows in a table

Inserting new rows when a simple primary key has to be generated

Inserting new rows when the primary key has to be generated based on stored values

Deleting data from a table

Creating or altering a table from PDI (design time)

Creating or altering a table from PDI (runtime)

Inserting, deleting, or updating a table depending on a field

Changing the database connection at runtime

Loading a parent-child table

Building SQL queries via database metadata

Performing repetitive database design tasks from PDI

Introduction

Databases are broadly used by organizations to store and administer transactional data such as customer service history, bank transactions, purchases, sales, and so on. They are also used to store data warehouse data used for Business Intelligence solutions.

In this chapter, you will learn to deal with databases in Kettle. The first recipe tells you how to connect to a database, which is a prerequisite for all the other recipes. The rest of the chapter teaches you how to perform different operations and can be read in any order according to your needs.

Note

The focus of this chapter is on relational databases (RDBMS). Thus, the term database is used as a synonym for relational database throughout the recipes.

Sample databases

Through the chapter you will use a couple of sample databases. Those databases can be created and loaded by running the scripts available at the book's website. The scripts are ready to run under MySQL.

Note

If you work with a different DBMS, you may have to modify the scripts slightly.

For more information about the structure of the sample databases and the meaning of the tables and fields, please refer to Appendix A, Data Structures. Feel free to adapt the recipes to different databases. You could try some well-known databases; for example, Foodmart (available as part of the Mondrian distribution at http://sourceforge.net/projects/mondrian/) or the MySQL sample databases (available at http://dev.mysql.com/doc/index-other.html).

Pentaho BI platform databases

As part of the sample databases used in this chapter you will use the Pentaho BI platform Demo databases. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore the capabilities of the Pentaho platform. It relies on the following databases:

By default, all those databases are stored in Hypersonic (HSQLDB). The script for creating the databases in HSQLDB can be found at http://sourceforge.net/projects/pentaho/files. Under Business Intelligence Server | 1.7.1-stable look for pentaho_sample_data-1.7.1.zip. While there are newer versions of the actual Business Intelligence Server, they all use the same sample dataset.

These databases can be stored in other DBMSs as well. Scripts for creating and loading these databases in other popular DBMSs for example, MySQL or Oracle can be found in

Enjoying the preview?

Page 1 of 1

Pentaho Data Integration Cookbook - Second Edition

About this ebook

María Carina Roldán

Related authors

Related to Pentaho Data Integration Cookbook - Second Edition

Related ebooks

Applications & Software For You

Related podcast episodes

Related articles

Related categories

Reviews for Pentaho Data Integration Cookbook - Second Edition

What did you think?

Book preview

Pentaho Data Integration Cookbook - Second Edition - María Carina Roldán

Table of Contents

Pentaho Data Integration Cookbook Second Edition

Pentaho Data Integration Cookbook Second Edition

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Chapter 1. Working with Databases

Introduction

Note

Sample databases

Note

Pentaho BI platform databases