Pentaho 3.2 Data Integration Beginner's Guide

Ebook1,053 pages5 hours

Pentaho 3.2 Data Integration Beginner's Guide

Name: Pentaho 3.2 Data Integration Beginner's Guide
Author: Maria Carina Roldan
ISBN: 9781847199553

By Maria Carina Roldan

Rating: 0 out of 5 stars

()

Read preview

About this ebook

In Detail

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.

This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.

Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

A practical, easy-to-read guide that gives you full understanding of the Pentaho Data Integration tool and shows you how to use it to your advantage to manipulate data

Approach

As part of Packt's Beginner's Guide, this book focuses on teaching by example. The book walks you through every aspect of PDI, giving step-by-step instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning PDI easy. Throughout the book numerous tips and helpful hints are provided that you will not find anywhere else.

The book provides short, practical examples and also builds from scratch a small datamart intended to reinforce the learned concepts and to teach you the basics of data warehousing.

Who this book is for

This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.

You will find this book is a good starting point if you are a database administrator, data warehouse designer, architect, or any person who is responsible for data warehouse projects and need to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.

Skip carousel

Computers

LanguageEnglish

PublisherPackt Publishing

Release dateApr 9, 2010

ISBN9781847199553

Author

Maria Carina Roldan

Maria Carina Roldan was born in Esquel, Argen na, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994. She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full me for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist. Carina is the author of Pentaho 3.2 Data Integra on Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integra on 4 Cookbook, Packt Publishing, June 2011.

Related authors

Skip carousel

Related to Pentaho 3.2 Data Integration Beginner's Guide

Related ebooks

Skip carousel

Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Java for Data Science
Ebook
Java for Data Science
byJennifer L. Reese
Rating: 0 out of 5 stars
0 ratings
Scala Data Analysis Cookbook
Ebook
Scala Data Analysis Cookbook
byManivannan Arun
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Oracle Warehouse Builder 11g: Getting Started
Ebook
Oracle Warehouse Builder 11g: Getting Started
byBob Griesemer
Rating: 0 out of 5 stars
0 ratings
Pentaho Data Integration Beginner's Guide
Ebook
Pentaho Data Integration Beginner's Guide
byMaria Carina Roldan
Rating: 4 out of 5 stars
4/5
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
Ebook
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
byDavid Hecksel
Rating: 5 out of 5 stars
5/5
Implementing Splunk - Second Edition
Ebook
Implementing Splunk - Second Edition
byVincent Bumgarner
Rating: 0 out of 5 stars
0 ratings
PrimeFaces Beginner's Guide
Ebook
PrimeFaces Beginner's Guide
byK. Siva Prasad Reddy
Rating: 0 out of 5 stars
0 ratings
WS-BPEL 2.0 Beginner's Guide
Ebook
WS-BPEL 2.0 Beginner's Guide
byMatjaz B. Juric
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
Instant Pentaho Data Integration Kitchen
Ebook
Instant Pentaho Data Integration Kitchen
bySergio Ramazzina
Rating: 0 out of 5 stars
0 ratings
ETL A Clear and Concise Reference
Ebook
ETL A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data Architecture A Complete Guide - 2019 Edition
Ebook
Big Data Architecture A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Relational Databases: State of the Art Report 14:5
Ebook
Relational Databases: State of the Art Report 14:5
byD A Bell
Rating: 0 out of 5 stars
0 ratings
Pentaho Data Integration 4 Cookbook
Ebook
Pentaho Data Integration 4 Cookbook
byAdriÃ¡n Sergio Pulvirenti
Rating: 0 out of 5 stars
0 ratings
Getting Started with Big Data Query using Apache Impala
Ebook
Getting Started with Big Data Query using Apache Impala
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Azure SQL A Complete Guide - 2021 Edition
Ebook
Azure SQL A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Platform CDP A Complete Guide - 2019 Edition
Ebook
Data Platform CDP A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Instant SQL Server Analysis Services 2012 Cube Security
Ebook
Instant SQL Server Analysis Services 2012 Cube Security
bySatya SK Jayanty
Rating: 0 out of 5 stars
0 ratings
Solr in Action
Ebook
Solr in Action
byTimothy Potter
Rating: 3 out of 5 stars
3/5
Cassandra Design Patterns - Second Edition
Ebook
Cassandra Design Patterns - Second Edition
byThottuvaikkatumana Rajanarayanan
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
HBase in Action
Ebook
HBase in Action
byAmandeep Khurana
Rating: 0 out of 5 stars
0 ratings
Scalatra in Action
Ebook
Scalatra in Action
byRoss Baker
Rating: 0 out of 5 stars
0 ratings
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset
Ebook
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset
byJoy Mundy
Rating: 0 out of 5 stars
0 ratings
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
Ebook
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
byApril Reeve
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
The Designer's Web Handbook: What You Need to Know to Create for the Web
Ebook
The Designer's Web Handbook: What You Need to Know to Create for the Web
byPatrick McNeil
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
What Video Games Have to Teach Us About Learning and Literacy. Second Edition
Ebook
What Video Games Have to Teach Us About Learning and Literacy. Second Edition
byJames Paul Gee
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
Podcast episode
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
Podcast episode
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
byData Engineering Podcast
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
CockroachDB In Depth with Peter Mattis - Episode 35
Podcast episode
CockroachDB In Depth with Peter Mattis - Episode 35
byData Engineering Podcast
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
Real World Change Data Capture At Datacoral: An interview with Raghu Murthy about the reality of running and maintaining change data capture pipelines for customers at Datacoral.
Podcast episode
Real World Change Data Capture At Datacoral: An interview with Raghu Murthy about the reality of running and maintaining change data capture pipelines for customers at Datacoral.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
Podcast episode
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
Podcast episode
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
Podcast episode
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
byThe Data Chief
0 ratings
0% found this document useful
Open Source Reverse ETL For Everyone With Grouparoo: An interview with Brian Leonard about the open source reverse ETL framework Grouparoo and how you can start using it today.
Podcast episode
Open Source Reverse ETL For Everyone With Grouparoo: An interview with Brian Leonard about the open source reverse ETL framework Grouparoo and how you can start using it today.
byData Engineering Podcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
Podcast episode
#456: Data Architectures with AWS Hero Elliott Cordo: AWS Data Hero and Head of Data at Capsule, Elliott Cordo, has built many ground-up data architecture
byAWS Podcast
0 ratings
0% found this document useful
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
Podcast episode
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
byPro Coder Show
0 ratings
0% found this document useful
Building and Testing Web Services with Postman: With Guest Host Abhinov Asthana
Podcast episode
Building and Testing Web Services with Postman: With Guest Host Abhinov Asthana
byProgramming Throwdown
0 ratings
0% found this document useful
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
Podcast episode
20 JavaScript Array and Object Methods to make you a better developer: Wes and Scott rattle through ~20 different Object and Arra Methods that will make you a better JavaScript developer. Freshbooks - Sponsor This is episode Wes mentions the free book . Get a 30 day free trial of Freshbooks at . Netlify —...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
Podcast episode
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
Podcast episode
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
A New Distributed Cloud Architecture
Podcast episode
A New Distributed Cloud Architecture
byThe Cloudcast
0 ratings
0% found this document useful
Distributed Systems Tradeoffs with Camille Fournier: Distributed systems products are often marketed with terms like “real-time data” and “hassle-free scaling”, but what do those terms actually mean? Is data in a distributed system ever reliably “real time”? Do we ever have strong enough plans about our ...
Podcast episode
Distributed Systems Tradeoffs with Camille Fournier: Distributed systems products are often marketed with terms like “real-time data” and “hassle-free scaling”, but what do those terms actually mean? Is data in a distributed system ever reliably “real time”? Do we ever have strong enough plans about our ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
How Shopify Is Building Their Production Data Warehouse Using DBT: An interview with Shopify's engineers about how they are using DBT to build a data warehouse platform that scales to meet the needs of the business.
Podcast episode
How Shopify Is Building Their Production Data Warehouse Using DBT: An interview with Shopify's engineers about how they are using DBT to build a data warehouse platform that scales to meet the needs of the business.
byData Engineering Podcast
0 ratings
0% found this document useful
Shining A Light on Shadow IT In Data And Analytics - Episode 121: A conversation about the conflicts that lead to shadow IT in data and analytics projects and how to work toward resolving those tensions.
Podcast episode
Shining A Light on Shadow IT In Data And Analytics - Episode 121: A conversation about the conflicts that lead to shadow IT in data and analytics projects and how to work toward resolving those tensions.
byData Engineering Podcast
0 ratings
0% found this document useful
Keeping Your Data Warehouse In Order With DataForm - Episode 102: An interview about Dataform and how it helps you to keep your data warehouse in good working order
Podcast episode
Keeping Your Data Warehouse In Order With DataForm - Episode 102: An interview about Dataform and how it helps you to keep your data warehouse in good working order
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
MARIADB Optimise And Control Your Databases
Linux Format
Article
MARIADB Optimise And Control Your Databases
Jul 30, 2019
9 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
Is My Data Really Safe? Your Questions About Cloud-Based Storage, Answered.
Entrepreneur
Article
Is My Data Really Safe? Your Questions About Cloud-Based Storage, Answered.
Nov 1, 2014
2 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
Collect And Graph Metrics With Python
Linux Format
Article
Collect And Graph Metrics With Python
May 4, 2021
7 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
14 Organization Projects You Can DO IN A DAY
Family Tree
Article
14 Organization Projects You Can DO IN A DAY
Dec 20, 2022
Standardize file names (either digital or physical) Standardize place names, dates, or personal names in your online tree/software program Color-code folders (either digital or physical) Fill out new, “clean” research forms to outline the scope of yo
1 min read
Finish Your Cataloguing App
Linux Format
Article
Finish Your Cataloguing App
Jan 10, 2023
Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. In his spare time, Matt enjoys listening to music and reading. More featurepacked source code for this project can be downlo
7 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Perfect Backup: Perfect? No, But Darn Close
PCWorld
Article
Perfect Backup: Perfect? No, But Darn Close
Jan 11, 2023
3 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
HoudahSpot 5
MacLife
Article
HoudahSpot 5
Jun 25, 2019
2 min read
Using and Transferring Metadata in Digital Photos
Family Tree
Article
Using and Transferring Metadata in Digital Photos
Apr 21, 2020
2 min read
Seven Ways To Future-proof Your SEO Strategy
Marketing
Article
Seven Ways To Future-proof Your SEO Strategy
Apr 8, 2018
Search engine optimisation (SEO) is always changing. To stay ahead of your competitors you need to be able to shift your SEO strategy. Expect to see mobile devices, artificial intelligence (AI) and voice search dominating the news. But what practical
3 min read
Help Yourself To Avoid These Pitfalls
MacLife
Article
Help Yourself To Avoid These Pitfalls
Dec 11, 2018
GETTING UP TO full speed with the Shortcuts app takes time, and you’ll inevitably make a few mistakes along the way. Having to troubleshoot your efforts doesn’t mean you’ve failed — with years of experience, even professional programmers do this. Tak
2 min read
Manage And Read Your System Logs
Linux Format
Article
Manage And Read Your System Logs
Mar 10, 2020
10 min read
How We Tested…
Linux Format
Article
How We Tested…
Aug 24, 2021
To explore the capabilities of these terminal-based browsers, we used them to access popular websites and tried to download data where applicable. We also attempted tasks that you would normally carry out on those sites. Once we had accessed a site,
1 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
Stop Using Windows Tools
Computeractive
Article
Stop Using Windows Tools
Apr 26, 2023
16 min read
Drill Down Deeper
MacLife
Article
Drill Down Deeper
Aug 16, 2022
2 min read
Quantitatively Measure Your PC Performance
APC
Article
Quantitatively Measure Your PC Performance
Jul 11, 2022
Understanding PC performance can seem a bit like peering into a black box – you’ve got a gut feel for your PC’s speed, but you can’t quite see what’s really going on. That’s where the science of benchmarking comes in – it allows you to measure in a r
4 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
Answers
Linux Format
Article
Answers
Mar 5, 2024
10 min read

Related categories

Skip carousel

Reviews for Pentaho 3.2 Data Integration Beginner's Guide

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Pentaho 3.2 Data Integration Beginner's Guide - Maria Carina Roldan

Pentaho 3.2 Data Integration

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Pentaho Data Integration

Using PDI in real world scenarios

Loading datawarehouses or datamarts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI using Pentaho BI

Pop quiz—PDI data sources

Installing PDI

Time for action—installing PDI

What just happened?

Pop quiz—PDI prerequisites

Launching the PDI graphical designer: Spoon

Time for action—starting and customizing Spoon

What just happened?

Spoon

Setting preferences in the Options window

Storing transformations and jobs in a repository

Creating your first transformation

Time for action—creating a hello world transformation

What just happened?

Directing the Kettle engine with transformations

Exploring the Spoon interface

Viewing the transformation structure

Running and previewing the transformation

Time for action—running and previewing the hello_world transformation

What just happened?

Previewing the results in the Execution Results window

Pop quiz—PDI basics

Installing MySQL

Time for action—installing MySQL on Windows

What just happened?

Time for action—installing MySQL on Ubuntu

What just happened?

Summary

2. Getting Started with Transformations

Reading data from files

Time for action—reading results of football matches from files

What just happened?

Input files

Input steps

Reading several files at once

Time for action—reading all your files at a time using a single Text file input step

What just happened?

Time for action reading all your files at a time using a single Text file input step and regular expressions

What just happened?

Regular expressions

Troubleshooting reading files

Grids

Have a go hero—explore your own files

Sending data to files

Time for action—sending the results of matches to a plain file

What just happened?

Output files

Output steps

Some data definitions

Rowset

Streams

The Select values step

Have a go hero—extending your transformations by writing output files

Getting system information

Time for action—updating a file with news about examinations

What just happened?

Getting information by using Get System Info step

Data types

Date fields

Numeric fields

Running transformations from a terminal window

Time for action—running the examination transformation from a terminal window

What just happened?

Have a go hero—using different date formats

Go for a hero formatting 99.55

Pop quiz—formatting data

XML files

Time for action—getting data from an XML file with information about countries

What just happened?

What is XML

PDI transformation files

Getting data from XML files

XPath

Configuring the Get data from XML step

Kettle variables

How and when you can use variables

Have a go hero—exploring XML files

Have a go hero—enhancing the output countries file

Have a go hero—documenting your work

Summary

3. Basic Data Manipulation

Basic calculations

Time for action—reviewing examinations by using the Calculator step

What just happened?

Adding or modifying fields by using different PDI steps

The Calculator step

The Formula step

Time for action—reviewing examinations by using the Formula step

What just happened?

Have a go hero—listing students and their examinations results

Pop quiz—concatenating strings

Calculations on groups of rows

Time for action—calculating World Cup statistics by grouping data

What just happened?

Group by step

Have a go hero—calculating statistics for the examinations

Have a go hero—listing the languages spoken by country

Filtering

Time for action—counting frequent words by filtering

What just happened?

Filtering rows using the Filter rows step

Have a go hero—playing with filters

Have a go hero—counting words and discarding those that are commonly used

Looking up data

Time for action—finding out which language people speak

What just happened?

The Stream lookup step

Have a go hero—counting words more precisely

Summary

4. Controlling the Flow of Data

Splitting streams

Time for action—browsing new PDI features by copying a dataset

What just happened?

Copying rows

Have a go hero—recalculating statistics

Distributing rows

Time for action—assigning tasks by distributing

What just happened?

Pop quiz—data movement (copying and distributing)

Splitting the stream based on conditions

Time for action - assigning tasks by filtering priorities with the Filter rows step

What just happened?

PDI steps for splitting the stream based on conditions

Time for action—assigning tasks by filtering priorities with the Switch/ Case step

What just happened?

Have a go hero—listing languages and countries

Pop quiz—splitting a stream

Merging streams

Time for action—gathering progress and merging all together

What just happened?

PDI options for merging streams

Time for action—giving priority to Bouchard by using Append Stream

What just happened?

Have a go hero—sorting and merging all tasks

Have a go hero—trying to find missing countries

Summary

5. Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action—calculating scores with JavaScript

What just happened?

Using the JavaScript language in PDI

Inserting JavaScript code using the Modified Java Script Value step

Adding fields

Modifying fields

Turning on the compatibility switch

Have a go hero—adding and modifying fields to the contest data

Testing your code

Time for action—testing the calculation of averages

What just happened?

Testing the script using the Test script button

Have a go hero—testing the new calculation of the average

Enriching the code

Time for action calculating flexible scores by using variables

What just happened?

Using named parameters

Using the special Start, Main, and End scripts

Using transformation predefined constants

Pop quiz—finding the 7 errors

Have a go hero—keeping the top 10 performances

Have a go hero—calculating scores with Java code

Reading and parsing unstructured files

Time for action—changing a list of house descriptions with JavaScript

What just happened?

Looking at previous rows

Have a go hero—enhancing the houses file

Have a go hero—fill gaps in the contest file

Avoiding coding by using purpose-built steps

Have a go hero—creating alternative solutions

Summary

6. Transforming the Row Set

Converting rows to columns

Time for action—enhancing a films file by converting rows to columns

What just happened?

Converting row data to column data by using the Row denormalizer step

Have a go hero—houses revisited

Aggregating data with a Row denormalizer step

Time for action—calculating total scores by performances by country

What just happened?

Using Row denormalizer for aggregating data

Have a go hero—calculating scores by skill by continent

Normalizing data

Time for action—enhancing the matches file by normalizing the dataset

What just happened?

Modifying the dataset with a Row Normalizer step

Summarizing the PDI steps that operate on sets of rows

Have a go hero—verifying the benefits of normalization

Have a go hero—normalizing the Films file

Have a go hero—calculating scores by judge

Generating a custom time dimension dataset by using Kettle variables

Time for action—creating the time dimension dataset

What just happened?

Getting variables

Time for action—getting variables for setting the default starting date

What just happened?

Using the Get Variables step

Have a go hero—enhancing the time dimension

Pop quiz—using Kettle variables inside transformations

Summary

7. Validating Data and Handling Errors

Capturing errors

Time for action—capturing errors while calculating the age of a film

What just happened?

Using PDI error handling functionality

Aborting a transformation

Time for action—aborting when there are too many errors

What just happened?

Aborting a transformation using the Abort step

Fixing captured errors

Time for action—treating errors that may appear

What just happened?

Treating rows coming to the error stream

Pop quiz—PDI error handling

Have a go hero—capturing errors while seeing who wins

Avoiding unexpected errors by validating data

Time for action validating genres with a Regex Evaluation step

What just happened?

Validating data

Time for action—checking films file with the Data Validator

What just happened?

Defining simple validation rules using the Data Validator

Have a go hero—validating the football matches file

Cleansing data

Have a go hero—cleansing films data

Summary

8. Working with Databases

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Time for action—creating a connection with the Steel Wheels database

What just happened?

Connecting with Relational Database Management Systems

Pop quiz—defining database connections

Have a go hero—connecting to your own databases

Exploring the Steel Wheels database

Time for action—exploring the sample database

What just happened?

A brief word about SQL

Exploring any configured database with the PDI Database explorer

Have a go hero—exploring the sample data in depth

Have a go hero—exploring your own databases

Querying a database

Time for action—getting data about shipped orders

What just happened?

Getting data from the database with the Table input step

Using the SELECT statement for generating a new dataset

Making flexible queries by using parameters

Time for action—getting orders in a range of dates by using parameters

What just happened?

Adding parameters to your queries

Making flexible queries by using Kettle variables

Time for action—getting orders in a range of dates by using variables

What just happened?

Using Kettle variables in your queries

Pop quiz—database datatypes versus PDI datatypes

Have a go hero—querying the sample data

Sending data to a database

Time for action—loading a table with a list of manufacturers

What just happened?

Inserting new data into a database table with the Table output step

Inserting or updating data by using other PDI steps

Time for action—inserting new products or updating existent ones

What just happened?

Time for action—testing the update of existing products

What just happened?

Inserting or updating data with the Insert/Update step

Have a go hero—populating a films database

Have a go hero—creating the time dimension

Have a go hero—populating the products table

Pop quiz—Insert/Update step versus Table Output/Update steps

Pop quiz—filtering the first 10 rows

Eliminating data from a database

Time for action—deleting data about discontinued items

What just happened?

Deleting records of a database table with the Delete step

Have a go hero—deleting old orders

Summary

9. Performing Advanced Operations with Databases

Preparing the environment

Time for action—populating the Jigsaw database

What just happened?

Exploring the Jigsaw database model

Looking up data in a database

Doing simple lookups

Time for action—using a Database lookup step to create a list of products to buy

What just happened?

Looking up values in a database with the Database lookup step

Have a go hero—preparing the delivery of the products

Have a go hero—refining the transformation

Doing complex lookups

Time for action using a Database join step to create a list of suggested products to buy

What just happened?

Joining data from the database to the stream data by using a Database join step

Have a go hero—rebuilding the list of customers

Introducing dimensional modeling

Loading dimensions with data

Time for action loading a region dimension with a Combination lookup/update step

What just happened?

Time for action—testing the transformation that loads the region dimension

What just happened?

Describing data with dimensions

Loading Type I SCD with a Combination lookup/update step

Have a go hero—adding regions to the Region Dimension

Have a go hero—loading the manufacturers dimension

Have a go hero—loading a mini-dimension

Keeping a history of changes

Time for action—keeping a history of product changes with the Dimension lookup/update step

What just happened?

Time for action—testing the transformation that keeps a history of product changes

What just happened?

Keeping an entire history of data with a Type II slowly changing dimension

Loading Type II SCDs with the Dimension lookup/update step

Have a go hero—keeping a history just for the theme of a product

Have a go hero—loading a Type II SCD dimension

Pop quiz—loading slowly changing dimensions

Pop quiz—loading type III slowly changing dimensions

Summary

10. Creating Basic Task Flows

Introducing PDI jobs

Time for action—creating a simple hello world job

What just happened?

Executing processes with PDI jobs

Using Spoon to design and run jobs

Using the transformation job entry

Pop quiz—defining PDI jobs

Have a go hero—loading the dimension tables

Receiving arguments and parameters in a job

Time for action—customizing the hello world file with arguments and parameters

What just happened?

Using named parameters in jobs

Have a go hero—backing up your work

Running jobs from a terminal window

Time for action—executing the hello world job from a terminal window

What just happened?

Have a go hero—experiencing Kitchen

Using named parameters and command-line arguments in transformations

Time for action—calling the hello world transformation with fixed arguments and parameters

What just happened?

Have a go hero—saying hello again and again

Have a go hero—loading the time dimension from a job

Deciding between the use of a command-line argument and a named parameter

Have a go hero—analysing the use of arguments and named parameters

Running job entries under conditions

Time for action—sending a sales report and warning the administrator if something is wrong

What just happened?

Changing the flow of execution on the basis of conditions

Have a go hero—refining the sales report

Creating and using a file results list

Have a go hero—sharing your work

Summary

11. Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action—updating a file with news about examinations by setting a variable with the name of the file

What just happened?

Setting variables inside a transformation

Have a go hero—enhancing the examination tutorial even more

Have a go hero—enhancing the jigsaw database update process

Have a go hero—executing the proper jigsaw database update process

Enhancing the design of your processes

Time for action—generating files with top scores

What just happened?

Pop quiz—using the Add Sequence step

Reusing part of your transformations

Time for action—calculating the top scores with a subtransformation

What just happened?

Creating and using subtransformations

Have a go hero—refining the subtransformation

Have a go hero—counting words more precisely (second version)

Creating a job as a process flow

Time for action—splitting the generation of top scores by copying and getting rows

What just happened?

Transferring data between transformations by using the copy /get rows mechanism

Have a go hero—modifying the flow

Nesting jobs

Time for action—generating the files with top scores by nesting jobs

What just happened?

Running a job inside another job with a job entry

Understanding the scope of variables

Pop quiz—deciding the scope of variables

Iterating jobs and transformations

Time for action—generating custom files by executing a transformation for every input row

What just happened?

Executing for each row

Have a go hero—processing several files at once

Have a go hero—building lists of products to buy

Have a go hero—e-mail students to let them know how they did

Summary

12. Developing and Implementing a Simple Datamart

Exploring the sales datamart

Deciding the level of granularity

Loading the dimensions

Time for action—loading dimensions for the sales datamart

What just happened?

Extending the sales datamart model

Have a go hero—loading the dimensions for the puzzles star model

Loading a fact table with aggregated data

Time for action—loading the sales fact table by looking up dimensions

What just happened?

Getting the information from the source with SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for a Type I SCD

Obtaining the surrogate key for a Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Pop quiz—modifying a star model and loading the star with PDI

Have a go hero—loading a puzzles fact table

Getting facts and dimensions together

Time for action—loading the fact table using a range of dates obtained from the command line

What just happened?

Time for action—loading the sales star

What just happened?

Have a go hero—enhancing the loading process of the sales fact table

Have a go hero—loading the puzzles sales star

Have a go hero—loading the facts once a month

Getting rid of administrative tasks

Time for action—automating the loading of the sales datamart

What just happened?

Have a go hero—Creating a back up of your work automatically

Have a go hero—enhancing the automate process by sending an e-mail if an error occurs

Summary

13. Taking it Further

PDI best practices

Getting the most out of PDI

Extending Kettle with plugins

Have a go hero—listing the top 10 students by using the Head plugin step

Overcoming real world risks with some remote execution

Scaling out to overcome bigger risks

Pop quiz—remote execution and clustering

Integrating PDI and the Pentaho BI suite

PDI as a process action

PDI as a datasource

More about the Pentaho suite

PDI Enterprise Edition and Kettle Developer Support

Summary

A. Working with Repositories

Creating a repository

Time for action—creating a PDI repository

What just happened?

Creating repositories to store your transformations and jobs

Working with the repository storage system

Time for action—logging into a repository

What just happened?

Logging into a repository by using credentials

Defining repository user accounts

Creating transformations and jobs in repository folders

Creating database connections, partitions, servers, and clusters

Backing up and restoring a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Specifying command line options

Checking the exit code

Providing options when running Pan and Kitchen

Log details

Named parameters

Arguments

Variables

C. Quick Reference: Steps and Job Entries

Transformation steps

Job entries

D. Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

E. Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Experiencing the mouse-over assistance

Time for action—creating a hop with the mouse-over assistance

What just happened?

Using the mouse-over assistance toolbar

Experiencing the sniff-testing feature

Experiencing the job drill-down feature

Experiencing even more visual changes

Enterprise features

Summary

F. Pop quiz—Answers

Chapter 1

PDI data sources

PDI prerequisites PDI basics

Chapter 2

formatting data

Chapter 3

concatenating strings

Chapter 4

data movement (copying and distributing)

splitting a stream

Chapter 5

finding the seven errors

Chapter 6

using Kettle variables inside transformations

Chapter 7

PDI error handling

Chapter 8

defining database connections

database datatypes versus PDI datatypes

Insert/Update step versus Table Output/Update steps

filtering the first 10 rows

Chapter 9

loading slowly changing dimensions

loading type III slowly changing dimensions

Chapter 10

defining PDI jobs

Chapter 11

using the Add sequence step

deciding the scope of variables

Chapter 12

modifying a star model and loading the star with PDI

Chapter 13

remote execution and clustering

Index

Pentaho 3.2 Data Integration Beginner's Guide

Maria Carina Roldan

Pentaho 3.2 Data Integration

Beginner's Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, Packt Publishing, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2010

Production Reference: 1050410

Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK.

ISBN 978-1-847199-54-6

www.packtpub.com

Cover Image by Parag Kadam (<paragvkadam@gmail.com>)

Credits

Author

María Carina Roldán

Reviewers

Jens Bleuel

Roland Bouman

Matt Casters

James Dixon

Will Gorman

Gretchen Moran

Acquisition Editor

Usha Iyer

Development Editor

Reshma Sundaresan

Technical Editors

Gaurav Datar

Rukhsana Khambatta

Copy Editor

Sanchari Mukherjee

Editorial Team Leader

Gagandeep Singh

Project Team Leader

Lata Basantani

Project Coordinator

Poorvi Nair

Proofreader

Sandra Hopper

Indexer

Rekha Nair

Graphics

Geetanjali Sawant

Production Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

Foreword

If we look back at what has happened in the data integration market over the last 10 years we can see a lot of change. In the first half of that decade there was an explosion in the number of data integration tools and in the second half there was a big wave of consolidations. This consolidation wave put an ever growing amount of data integration power in the hands of only a few large billion dollar companies. For any person, company or project in need of data integration, this meant either paying large amounts of money or doing hand-coding of their solution.

During that exact same period, we saw web servers, programming languages, operating systems, and even relational databases turn into a commodity in the ICT market place. This was driven among other things by the availability of open source software such as Apache, GNU, Linux, MySQL, and many others. For the ICT market, this meant that more services could be deployed at a lower cost. If you look closely at what has been going on in those last 10 years, you will notice that most companies increasingly deployed more ICT services to end-users. These services get more and more connected over an ever growing network. Pretty much anything ranging from tiny mobile devices to huge cloud-based infrastructure is being deployed and all those can contain data that is valuable to an organization.

The job of any person that needs to integrate all this data is not easy. Complexity of information services technology usually increases exponentially with the number of systems involved. Because of this, integrating all these systems can be a daunting and scary task that is never complete. Any piece of code lives in what can be described as a software ecosystem that is always in a state of flux. Like in nature, certain ecosystems evolve extremely fast where others change very slowly over time. However, like in nature all ICT systems change. What is needed is another wave of commodification in the area of data integration and business intelligence in general. This is where Pentaho comes in.

Pentaho tries to provide answers to these problems by making the integration software available as open source, accessible, easy to use, and easy to maintain for users and developers alike. Every release of our software we try to make things easier, better, and faster. However, even if things can be done with nice user interfaces, there are still a huge amount of possibilities and options to choose from.

As the founder of the project I've always liked the fact that Kettle users had a lot of choice. Choice translates into creativity, and creativity often delivers good solutions that are comfortable to the person implementing them. However, this choice can be daunting to any beginning Kettle developer. With thousands of options to choose from, it can be very hard to get started.

This is above all others the reason why I'm very happy to see this book come to life. It will be a great and indispensable help for everyone that is taking steps into the wonderful world of data integration with Kettle. As such, I hope you see this book as an open invitation to get started with Kettle in the wonderful world of data integration.

Matt Casters

Chief Data Integration at Pentaho

Kettle founder

The Kettle Project

Whether there is a migration to do, an ETL process to run, or a need for massively loading data into a database, you have several software tools, ranging from expensive and sophisticated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly different. By 2000, Matt Casters, a Belgian business intelligent consultant, had been working for a while as a datawarehouse architect and administrator. As such, he was one of quite a number of people who, no matter if the company they worked for was big or small, had to deal with the difficulties that involve bridging the gap between information technology and business needs. What made it even worse at that time was that ETL tools were prohibitively expensive and everything had to be crafted done. The last employer he worked for, didn't think that writing a new ETL tool would be a good idea. This was one of the motivations for Matt to become an independent contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of software for himself to do ETL tasks. It was going to take up some time left and right in the evenings and weekends. Surprised, she asked how long it would take you to get it done. He replied that it would probably take five years and that he perhaps would have something working in three.

Working on that started in early 2003. Matt's main goals for writing the software included learning about databases, ETL processes, and data warehousing. This would in turn improve his chances on a job market that was pretty volatile. Ultimately, it would allow him to work full time on the software.

Another important goal was to understand what the tool had to do. Matt wanted a scalable and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The first idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the people behind Qt, the core UI library of KDE, had released database plans to create drivers for popular databases. However, the lack of decent drivers for those databases drove Matt to change plans and use Java. He picked Java because he had some prior experience as he had written a Japanese Chess (Shogi) database program when Java 1.0 was released. To Sun's credit, this software still runs and is available at http://ibridge.be/shogi/.

After a year of development, the tool was capable of reading text files, reading from databases, writing to databases and it was very flexible. The experience with Java was not 100% positive though. The code had grown unstructured, crashes occurred all too often, and it was hard to get something going with the Java graphic library used at that moment, the Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Matt decided to start using the newly released Standard Widget Toolkit (SWT), which helped solve part of the problem. As for the rest, Kettle was a complete mess. It was time to ask for help. The help came in hands of Wim De Clercq, a senior enterprise Java architect, co-owner of Ixor (www.ixor.be) and also friend of Matt. At various intervals over the next few years, Wim involved himself in the project, giving advices to Matt about good practices in Java programming. Listening to that advice meant performing massive amounts of code changes. As a consequence, it was not unusual to spend weekends doing nothing but refactoring code and fixing thousands of errors because of that. But, bit by bit, things kept going in the right direction.

At that same time, Matt also showed the results to his peers, colleagues, and other senior BI consultants to hear what they thought of Kettle. That was how he got in touch with the Flemish Traffic Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions of rows of data had to be integrated from thousands of data sources all over Belgium. All of a sudden, he was being paid to deploy and improve Kettle to handle that job. The diversity of test cases at the traffic center helped to improve Kettle dramatically. That was somewhere in 2004 and Kettle was by its version 1.2.

While working at Flemish, Matt also posted messages on Javaforge (www.javaforge.com) to let people know they could download a free copy of Kettle for their own use. He got a few reactions. Despite some of them being remarkably negative, most were positive. The most interesting response came from a nice guy called Jens Bleuel in Germany who asked if it was possible to integrate third-party software into Kettle. In his specific case, he needed a connector to link Kettle with the German SAP software (www.sap.com). Kettle didn't have a plugin architecture, so Jens' question made Matt think about a plugin system, and that was the main motivation for developing version 2.0.

For various reasons including the birth of Matt's son Sam and a lot of consultancy work, it took around a year to release Kettle version 2.0. It was a fairly complete release with advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains those concepts), ability to connect to thirteen different databases, and the most important fact being support for plugins. Matt contacted Jens to let him know the news and Jens was really interested. It was a very memorable moment for Matt and Jens as it took them only a few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot of excitement, and they agreed to start promoting the sales of Kettle from the Kettle.be website and from Proratio (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it became too much to handle. Doing development and sales all by themselves was no fun after a while. As such, Matt thought about open sourcing Kettle early in 2005 and by late summer he made his decision. Jens and Proratio didn't mind and the decision was final.

When they finally open sourced Kettle on December 2005, the response was massive. The downloadable package put up on Javaforge got downloaded around 35000 times during first week only. The news got spread all over the world pretty quickly.

What followed was a flood of messages, both private and on the forum. At its peak in March 2006, Matt got over 300 messages a day concerning Kettle.

In no time, he was answering questions like crazy, allowing people to join the development team and working as a consultant at the same time. Added to this, the birth of his daughter Hannelore in February 2006 was too much to deal with.

Fortunately, good times came. While Matt was trying to handle all that, a discussion was taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL software, but they didn't have a strong reliance on a specific tool.

While Jens was evaluating all sorts of open source BI packages, he came across that thread. Matt replied immediately persuading people at Pentaho to consider including Kettle. And he must be convincing because the answer came quickly and was positive. James Dixon, Pentaho founder and CTO, opened Kettle the possibility to be the premier and only ETL tool supported by Pentaho. Later on, Matt came in touch with one of the other Pentaho founders, Richard Daley, who offered him a job. That allowed Matt to focus full-time on Kettle. Four years later, he's still happily working for Pentaho as chief architect for data integration, doing the best effort to deliver Kettle 4.0. Jens Bleuel, who collaborated with Matt since the early versions, is now also part of the Pentaho team.

About the Author

María Carina was born in a small town in the Patagonia region in Argentina. She earned her Bachelor degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has lived since 1994 working in IT.

She has been working as a BI consultant for the last 10 years. At the beginning she worked with Cognos suite. However, over the last three years, she has been dedicated, full time, to developing Pentaho BI solutions both for local and several Latin-American companies, as well as for a French automotive company in the last months.

She is also an active contributor to the Pentaho community.

At present, she lives in Buenos Aires, Argentina, with her husband Adrián and children Camila and Nicolás.

Writing my first book in a foreign language and working on a full time job at the same time, not to mention the upbringing of two small kids, was definitely a big challenge. Now I can tell that it's not impossible.

I dedicate this book to my husband and kids; I'd like to thank them for all their support and tolerance over the last year. I'd also like to thank my colleagues and friends who gave me encouraging words throughout the writing process.

Special thanks to the people at Packt; working with them has been really pleasant.

I'd also like to thank the Pentaho community and developers for making Kettle the incredible tool it is. Thanks to the technical reviewers who, with their very critical eye, contributed to make this a book suited to the audience.

Finally, I'd like to thank Matt Casters who, despite

Enjoying the preview?

Page 1 of 1

Pentaho 3.2 Data Integration Beginner's Guide

About this ebook

Maria Carina Roldan

Related authors

Related to Pentaho 3.2 Data Integration Beginner's Guide

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Pentaho 3.2 Data Integration Beginner's Guide

What did you think?

Book preview

Pentaho 3.2 Data Integration Beginner's Guide - Maria Carina Roldan

Table of Contents

Pentaho 3.2 Data Integration

Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author