Pentaho Data Integration Cookbook - Second Edition
()
About this ebook
Related to Pentaho Data Integration Cookbook - Second Edition
Related ebooks
Pentaho Data Integration 4 Cookbook Rating: 0 out of 5 stars0 ratingsBusiness Intelligence Cookbook: A Project Lifecycle Approach Using Oracle Technology Rating: 0 out of 5 stars0 ratingsWindows Application Development Cookbook Rating: 0 out of 5 stars0 ratingsMDX with Microsoft SQL Server 2016 Analysis Services Cookbook - Third Edition Rating: 0 out of 5 stars0 ratingsMDX with SSAS 2012 Cookbook Rating: 0 out of 5 stars0 ratingsPostgreSQL 9 High Availability Cookbook Rating: 5 out of 5 stars5/5SQL Server Analysis Services 2012 Cube Development Cookbook Rating: 0 out of 5 stars0 ratingsPostgreSQL 9 Administration Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsD Cookbook Rating: 0 out of 5 stars0 ratingsTalend Open Studio Cookbook Rating: 2 out of 5 stars2/5QlikView for Developers Cookbook Rating: 0 out of 5 stars0 ratingsMicrosoft Team Foundation Server 2015 Cookbook Rating: 0 out of 5 stars0 ratingsPentaho Data Integration Beginner's Guide Rating: 4 out of 5 stars4/5Relational Databases: State of the Art Report 14:5 Rating: 0 out of 5 stars0 ratingsData Warehouse Architecture A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsOracle Warehouse Builder 11g: Getting Started Rating: 0 out of 5 stars0 ratingsPentaho 3.2 Data Integration Beginner's Guide Rating: 0 out of 5 stars0 ratingsETL A Clear and Concise Reference Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsScalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture Rating: 0 out of 5 stars0 ratingsLogical Data Warehouse A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsBig Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsThe Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset Rating: 0 out of 5 stars0 ratingsAzure Data Lake A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsLogical data model A Clear and Concise Reference Rating: 0 out of 5 stars0 ratingsChange data capture Third Edition Rating: 0 out of 5 stars0 ratingsEnterprise Architecture at Work: Modelling, Communication and Analysis Rating: 2 out of 5 stars2/5DataOps A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsSemantic Data Model A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsMastering Data Warehouse Design: Relational and Dimensional Techniques Rating: 4 out of 5 stars4/5
Applications & Software For You
Sound Design for Filmmakers: Film School Sound Rating: 5 out of 5 stars5/5Logic Pro X For Dummies Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Vocal Rescue: Rediscover the Beauty, Power and Freedom in Your Singing Rating: 4 out of 5 stars4/5Adobe Photoshop: A Complete Course and Compendium of Features Rating: 5 out of 5 stars5/5Adobe Premiere Pro: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsGarageBand For Dummies Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Audio For Authors: Audiobooks, Podcasting, And Voice Technologies Rating: 5 out of 5 stars5/5Adobe Illustrator: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsThe Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5Blender 3D Basics Beginner's Guide Second Edition Rating: 5 out of 5 stars5/5Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2 Rating: 3 out of 5 stars3/5FL Studio Cookbook Rating: 4 out of 5 stars4/5Mastering ChatGPT Rating: 0 out of 5 stars0 ratingsAdobe After Effects: A Complete Course and Compendium of Features Rating: 0 out of 5 stars0 ratingsSix Figure Blogging In 3 Months Rating: 4 out of 5 stars4/5GarageBand Basics: The Complete Guide to GarageBand: Music Rating: 0 out of 5 stars0 ratingsHow Do I Do That In InDesign? Rating: 5 out of 5 stars5/5Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data Rating: 0 out of 5 stars0 ratingsSignificant Zero: Heroes, Villains, and the Fight for Art and Soul in Video Games Rating: 4 out of 5 stars4/580 Ways to Use ChatGPT in the Classroom Rating: 5 out of 5 stars5/5iPhone Photography For Dummies Rating: 0 out of 5 stars0 ratingsCanon EOS Rebel T3/1100D For Dummies Rating: 5 out of 5 stars5/5OneNote Recipes: Elegant Techniques for Problem Solving and Visual Thinking Rating: 5 out of 5 stars5/5
Reviews for Pentaho Data Integration Cookbook - Second Edition
0 ratings0 reviews
Book preview
Pentaho Data Integration Cookbook - Second Edition - María Carina Roldán
Table of Contents
Pentaho Data Integration Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Working with Databases
Introduction
Sample databases
Pentaho BI platform databases
Connecting to a database
Getting ready
How to do it...
How it works...
There's more...
Avoiding creating the same database connection over and over again
Avoiding modifying jobs and transformations every time a connection changes
Specifying advanced connection properties
Connecting to a database not supported by Kettle
Checking the database connection at runtime
Getting data from a database
Getting ready
How to do it...
How it works...
There's more...
See also
Getting data from a database by providing parameters
Getting ready
How to do it...
How it works...
There's more...
Parameters coming in more than one row
Executing the SELECT statement several times, each for a different set of parameters
See also
Getting data from a database by running a query built at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Inserting or updating rows in a table
Getting ready
How to do it...
How it works...
There's more...
Alternative solution if you just want to insert records
Alternative solution if you just want to update rows
Alternative way for inserting and updating
See also
Inserting new rows where a simple primary key has to be generated
Getting ready
How to do it...
How it works...
There's more...
Using the Combination lookup/update for looking up
See also
Inserting new rows where the primary key has to be generated based on stored values
Getting ready
How to do it...
How it works...
There's more...
See also
Deleting data from a table
Getting ready
How to do it...
How it works...
See also
Creating or altering a database table from PDI (design time)
Getting ready
How to do it...
How it works...
There's more...
See also
Creating or altering a database table from PDI (runtime)
How to do it...
How it works...
There's more...
See also
Inserting, deleting, or updating a table depending on a field
Getting ready
How to do it...
How it works...
There's more...
Insert, update, and delete all-in-one
Synchronizing after merge
See also
Changing the database connection at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Loading a parent-child table
Getting ready
How to do it...
How it works...
See also
Building SQL queries via database metadata
Getting ready
How to do It...
How it works...
See also
Performing repetitive database design tasks from PDI
Getting ready
How to do It...
How it works...
See also
2. Reading and Writing Files
Introduction
Reading a simple file
Getting ready
How to do it...
How it works...
There's more...
Alternative notation for a separator
About file format and encoding
About data types and formats
Altering the names, order, or metadata of the fields coming from the file
Reading files with fixed width fields
Reading several files at the same time
Getting ready
How to do it...
How it works...
There's more...
Reading semi-structured files
Getting ready
How to do it...
How it works...
There's more...
Master/detail files
Logfiles
See also
Reading files having one field per row
Getting ready
How to do it...
How it works...
There's more...
See also
Reading files with some fields occupying two or more rows
Getting ready
How to do it...
How it works...
See also
Writing a simple file
Getting ready
How to do it...
How it works...
There's more...
Changing headers
Giving the output fields a format
Writing a semi-structured file
Getting ready
How to do it...
How it works...
There's more...
Providing the name of a file (for reading or writing) dynamically
Getting ready
How to do it...
How it works...
There's more...
Get System Info
Generating several files simultaneously with the same structure, but different names
Using the name of a file (or part of it) as a field
Getting ready
How to do it...
How it works...
Reading an Excel file
Getting ready
How to do it...
How it works...
See also
Getting the value of specific cells in an Excel file
Getting ready
How to do it...
How it works...
There's more...
Looking for a given cell
Writing an Excel file with several sheets
Getting ready
How to do it...
How it works...
There's more...
See also
Writing an Excel file with a dynamic number of sheets
Getting ready
How to do it...
How it works...
See also
Reading data from an AWS S3 Instance
Getting ready
How to do it...
How it works...
See also
3. Working with Big Data and Cloud Sources
Introduction
Loading data into Salesforce.com
Getting ready
How to do it...
How it works...
See also
Getting data from Salesforce.com
Getting ready
How to do it...
How it works...
See also
Loading data into Hadoop
Getting ready
How to do it...
How it works...
There's more...
See also
Getting data from Hadoop
Getting ready
How to do it...
How it works...
See also
Loading data into HBase
Getting ready
How to do it...
How it works...
There's more...
See also
Getting data from HBase
Getting ready
How to do it...
How it works...
See also
Loading data into MongoDB
Getting ready
How to do it...
How it works...
See also
Getting data from MongoDB
Getting ready
How to do it...
How it works...
See also
4. Manipulating XML Structures
Introduction
Reading simple XML files
Getting ready
How to do it...
How it works...
There's more...
XML data in a field
XML file name in a field
See also
Specifying fields by using the Path notation
Getting ready
How to do it...
How it works...
There's more...
Getting data from a different path
Getting data selectively
Getting more than one node when the nodes share their Path notation
Saving time when specifying Path
Validating well-formed XML files
Getting ready
How to do it...
How it works...
See also
Validating an XML file against DTD definitions
Getting ready
How to do it...
How it works...
There's more...
See also
Validating an XML file against an XSD schema
Getting ready
How to do it...
How it works...
There's more...
See also
Generating a simple XML document
Getting ready
How to do it...
How it works...
There's more...
Generating fields with XML structures
See also
Generating complex XML structures
Getting ready
How to do it...
How it works...
See also
Generating an HTML page using XML and XSL transformations
Getting ready
How to do it...
How it works...
There's more...
See also
Reading an RSS Feed
Getting ready
How to do it...
How it works...
See also
Generating an RSS Feed
Getting ready
How to do it...
How it works
There's more...
See also
5. File Management
Introduction
Copying or moving one or more files
Getting ready
How to do it...
How it works...
There's more...
Moving files
Detecting the existence of the files before copying them
Creating folders
See also
Deleting one or more files
Getting ready
How to do it...
How it works...
There's more...
Figuring out which files have been deleted
See also
Getting files from a remote server
How to do it...
How it works...
There's more...
Specifying files to transfer
Some considerations about connecting to an FTP server
Access via SFTP
Access via FTPS
Getting information about the files being transferred
See also
Putting files on a remote server
Getting ready
How to do it...
How it works...
There's more...
See also
Copying or moving a custom list of files
Getting ready
How to do it...
How it works...
See also
Deleting a custom list of files
Getting ready
How to do it...
How it works...
See also
Comparing files and folders
Getting ready
How to do it...
How it works...
There's more...
Comparing folders
Working with ZIP files
Getting ready
How to do it...
How it works...
There's more...
Avoiding zipping files
Avoiding unzipping files
See also
Encrypting and decrypting files
Getting ready
How to do it...
How it works...
There's more...
See also
6. Looking for Data
Introduction
Looking for values in a database table
Getting ready
How to do it...
How it works...
There's more...
Taking some action when the lookup fails
Taking some action when there are too many results
Looking for non-existent data
See also
Looking for values in a database with complex conditions
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a database with dynamic queries
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a variety of sources
Getting ready
How to do it...
How it works...
There's more...
Looking for alternatives when the Stream Lookup step doesn't meet your needs
Speeding up your transformation
Using the Value Mapper step for looking up from a short list of values
See also
Looking for values by proximity
Getting ready
How to do it...
How it works...
There's more...
Looking for values by using a web service
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values over intranet or the Internet
Getting ready
How to do it...
How it works...
There's more...
See also
Validating data at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
7. Understanding and Optimizing Data Flows
Introduction
Splitting a stream into two or more streams based on a condition
Getting ready
How to do it...
How it works...
There's more...
Avoiding the use of Dummy steps
Comparing against the value of a Kettle variable
Avoiding the use of nested Filter rows steps
Overcoming the difficulties of complex conditions
Merging rows of two streams with the same or different structures
Getting ready
How to do it...
How it works...
There's more...
Making sure that the metadata of the streams is the same
Telling Kettle how to merge the rows of your streams
See also
Adding checksums to verify datasets
Getting ready
How to do it...
How it works...
Comparing two streams and generating differences
Getting ready
How to do it...
How it works...
There's more...
Using the differences to keep a table up-to-date
See also
Generating all possible pairs formed from two datasets
How to do it...
How it works...
There's more...
Getting variables in the middle of the stream
Limiting the number of output rows
See also
Joining two or more streams based on given conditions
Getting ready
How to do it...
How it works...
There's more...
See also
Interspersing new rows between existent rows
Getting ready
How to do it...
How it works...
See also
Executing steps even when your stream is empty
Getting ready
How to do it...
How it works...
There's more...
Processing rows differently based on the row number
Getting ready
How to do it...
How it works...
There's more...
Identifying specific rows
Identifying the last row in the stream
Avoiding using an Add sequence step to enumerate the rows
See also
Processing data into shared transformations via filter criteria and subtransformations
Getting ready
How to do it...
How it works...
See also
Altering a data stream with Select values
How to do it...
How it works...
Processing multiple jobs or transformations in parallel
How to do it...
How it works...
See also
8. Executing and Re-using Jobs and Transformations
Introduction
Sample transformations
Sample transformation – hello
Sample transformation – random list
Sample transformation – sequence
Sample transformation – file list
Launching jobs and transformations
How to do it...
How it works...
Executing a job or a transformation by setting static arguments and parameters
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation whose name is determined at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Executing part of a job once for every row in a dataset
Getting ready
How to do it...
How it works...
There's more...
Accessing the copied rows from jobs, transformations, and other entries
Executing a transformation once for every row in a dataset
Executing a transformation or part of a job once for every file in a list of files
See also
Executing part of a job several times until a condition is true
Getting ready
How to do it...
How it works...
There's more...
Implementing loops in a job
Using the JavaScript step to control the execution of the entries in your job
See also
Creating a process flow
Getting ready
How to do it...
How it works...
There's more...
Serializing/De-serializing data
Other means for transferring or sharing data between transformations
Moving part of a transformation to a subtransformation
Getting ready
How to do it...
How it works...
There's more...
Using Metadata Injection to re-use transformations
Getting ready
How to do it...
How it works...
There's more...
9. Integrating Kettle and the Pentaho Suite
Introduction
A sample transformation
Creating a Pentaho report with data coming from PDI
Getting ready
How to do it...
How it works...
There's more...
Creating a Pentaho report directly from PDI
Getting ready
How to do it...
How it works...
There's more...
See also
Configuring the Pentaho BI Server for running PDI jobs and transformations
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a PDI transformation as part of a Pentaho process
Getting ready
How to do it...
How it works...
There's more...
Specifying the location of the transformation
Supplying values for named parameters, variables and arguments
Keeping things simple when it's time to deliver a plain file
See also
Executing a PDI job from the Pentaho User Console
Getting ready
How to do it...
How it works...
There's more...
See also
Generating files from the PUC with PDI and the CDA plugin
Getting ready
How to do it...
How it works...
There's more...
Populating a CDF dashboard with data coming from a PDI transformation
Getting ready
How to do it...
How it works...
There's more...
10. Getting the Most Out of Kettle
Introduction
Sending e-mails with attached files
Getting ready
How to do it...
How it works...
There's more...
Sending logs through an e-mail
Sending e-mails in a transformation
Generating a custom logfile
Getting ready
How to do it...
How it works...
There's more...
Filtering the logfile
Creating a clean logfile
Isolating logfiles for different jobs or transformations
See also
Running commands on another server
Getting ready
How to do it...
How it works...
See also
Programming custom functionality
Getting ready
How to do it...
How it works...
There's more...
Data type's equivalence
Generalizing your UDJC code
Looking up information with additional steps
Customizing logs
Scripting alternatives to the UDJC step
Generating sample data for testing purposes
How to do it...
How it works...
There's more...
Using a Data grid step to generate specific data
Working with subsets of your data
See also
Working with JSON files
Getting ready
How to do it...
How it works...
There's more...
Reading JSON files dynamically
Writing JSON files
Getting information about transformations and jobs (file-based)
Getting ready
How to do it...
How it works...
There's more...
Job XML nodes
Steps and entries information
See also
Getting information about transformations and jobs (repository-based)
Getting ready
How to do it...
How it works...
There's more...
Transformation tables
Job tables
Database connections tables
Using Spoon's built-in optimization tools
Getting ready
How to do it...
How it works...
There's more...
11. Utilizing Visualization Tools in Kettle
Introduction
Managing plugins with the Marketplace
Getting ready
How to do it...
How it works...
There's more...
See also
Data profiling with DataCleaner
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing data with AgileBI
Getting ready
How to do it...
How it works...
There's more...
See also
Using Instaview to analyze and visualize data
Getting ready
How to do it...
How it works...
There's more...
See also
12. Data Analytics
Introduction
Reading data from a SAS datafile
Why read a SAS file?
Getting ready
How to do it...
How it works...
See also
Studying data via stream statistics
Getting ready
How to do it...
How it works...
See also
Building a random data sample for Weka
Getting ready
How to do it...
How it works...
There's more...
See also
A. Data Structures
Books data structure
Books
Authors
museums data structure
museums
cities
outdoor data structure
products
categories
Steel Wheels data structure
Lahman Baseball Database
B. References
Books
Online
Index
Pentaho Data Integration Cookbook Second Edition
Pentaho Data Integration Cookbook Second Edition
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2011
Second Edition: November 2013
Production Reference: 1151113
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-067-4
www.packtpub.com
Cover Image by Aniket Sawant (<aniket_sawant_photography@hotmail.com>)
Credits
Author
Alex Meadows
Adrián Sergio Pulvirenti
María Carina Roldán
Reviewers
Wesley Seidel Carvalho
Daniel Lemire
Coty Sutherland
Acquisition Editor
Meeta Rajani
Lead Technical Editor
Arvind Koul
Technical Editors
Dennis John
Adrian Raposo
Gaurav Thingalaya
Project Coordinator
Wendell Palmer
Proofreader
Kevin McGowan
Indexer
Monica Ajmera Mehta
Graphics
Ronak Dhruv
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite
About the Author
Alex Meadows has worked with open source Business Intelligence solutions for nearly 10 years and has worked in various industries such as plastics manufacturing, social and e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in Pentaho and other open source communities to learn, share, and help newcomers with the best practices in BI, analytics, and data management. He received his Bachelor's degree in Business Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.
First and foremost, thank you Christina for being there for me before, during, and after taking on the challenge of writing and revising a book. I know it's not been easy, but thank you for allowing me the opportunity. To my grandmother, thank you for teaching me at a young age to always go for goals that may just be out of reach. Finally, this book would be no where without the Pentaho community and the friends I've made over the years being a part of it.
Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.
He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.
I'd like to thank my lovely kids, Camila and Nicolas, who understood that I couldn't share with them the usual video game sessions during the writing process. I'd also like to thank my wife, who introduced me to the Pentaho world.
María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelor's degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in Statistics at the University of Buenos Aires (UBA) in Buenos Aires city, where she has been living since 1994.
She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently, she works for Webdetails, one of the main Pentaho contributors. She is the author of Pentaho 3.2 Data Integration: Beginner's Guide published by Packt Publishing in April 2010.
You can follow her on Twitter at @mariacroldan.
I'd like to thank those who have encouraged me to write this book: On one hand, the Pentaho community; they have given me a rewarding feedback after the Beginner's book. On the other side, my husband, who without hesitation, agreed to write the book with me. Without them I'm not sure I would have embarked on a new book project.
I'd also like to thank the technical reviewers for the time and dedication that they have put in reviewing the book. In particular, thanks to my colleagues at Webdetails; it's a pleasure and a privilege to work with them every day.
About the Reviewers
Wesley Seidel Carvalho got his Master's degree in Computer Science from the Institute of Mathematics and Statistics, University of São Paulo (IME-USP), Brazil, where he researched on (his dissertation) Natural Language Processing (NLP) for the Portuguese language. He is a Database Specialist from the Federal University of Pará (UFPa). He has a degree in Mathematics from the State University of Pará (Uepa).
Since 2010, he has been working with Pentaho and researching Open Data government. He is an active member of the communities and lists of Free Software, Open Data, and Pentaho in Brazil, contributing software Grammar Checker for OpenOffice - CoGrOO
and CoGrOO Community.
He has worked with technology, database, and systems development since 1997, Business Intelligence since 2003, and has been involved with Pentaho and NLP since 2009. He is currently serving its customers through its startups:
http://intelidados.com.br
http://ltasks.com.br
Daniel Lemire has a B.Sc. and a M.Sc. in Mathematics from the University of Toronto, and a Ph.D. in Engineering Mathematics from the Ecole Polytechnique and the Université de Montréal. He is a Computer Science professor at TELUQ (Université du Québec) where he teaches Primarily Online. He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including more than 25 journal articles. He has held competitive research grants for the last 15 years. He has served as a program committee member on leading computer science conferences (for example, ACM CIKM, ACM WSDM, and ACM RecSys). His open source software has been used by major corporations such as Google and Facebook. His research interests include databases, information retrieval, and high performance programming. He blogs regularly on computer science at http://lemire.me/blog/.
Coty Sutherland was first introduced to computing around the age of 10. At that time, he was immersed in various aspects of computers and it became apparent that he had a propensity for software manipulation. From then until now, he has stayed involved in learning new things in the software space and adapting to the changing environment that is Software Development. He graduated from Appalachian State University in 2009 with a Bachelor's Degree in Computer Science. After graduation, he focused mainly on software application development and support, but recently transitioned to the Business Intelligence field to pursue new and exciting things with data. He is currently employed by the open source company, Red Hat, as a Business Intelligence Engineer.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
Pentaho Data Integration (also known as Kettle) is one of the leading open source data integration solutions. With Kettle, you can take data from a multitude of sources, transform and conform the data to given requirements, and load the data into just as many target systems. Not only is PDI capable of transforming and cleaning data, it also provides an ever-growing number of plugins to augment what is already a very robust list of features.
Pentaho Data Integration Cookbook Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics working with Big Data and cloud sources, data analytics, and more.
Pentaho Data Integration Cookbook Second Edition shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The book starts with showing you how to work with data sources files, relational databases, . Then we go into how to work with data streams merging data from different sources, how to take advantage of the different tools to clean up and transform data, and how to build nested jobs and transformations. More advanced topics are also covered, data analytics, data visualization, plugins, and integration of Kettle with other tools in the Pentaho suite.
Pentaho Data Integration Cookbook Second Edition provides recipes with easy step-by-step instructions to accomplish specific tasks. The code for the recipes can be adapted and built upon to meet individual needs.
What this book covers
Chapter 1, Working with Databases, shows you how to work with relational databases with Kettle. The recipes show you how to create and share database connections, perform typical database functions (select, insert, update,delete), as well as more advanced tricks building and executing queries at runtime.
Chapter 2, Reading and Writing Files, not only shows you how to read and write files, but also how to work with semi-structured files, and read data from Amazon Web Services.
Chapter 3, Working with Big Data and Cloud Sources, covers how to load and read data from some of the many different NoSQL data sources as well as from Salesforce.com.
Chapter 4, Manipulating XML Structures, shows you how to read, write, and validate XML. Simple and complex XML structures are shown as well as more specialized formats RSS feeds.
Chapter 5, File Management, demonstrates how to copy, move, transfer, and encrypt files and directories.
Chapter 6, Looking for Data, shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle's builtin validation steps.
Chapter 7, Understanding and Optimizing Data Flows, details how Kettle moves data through jobs and transformations and how to optimize data flows.
Chapter 8, Executing and Re-using Jobs and Transformations, shows you how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Objectoriented transformations through subtransformations also explained.
Chapter 9, Integrating Kettle and the Pentaho Suite, works with some of the other tools in the Pentaho suite to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more.
Chapter 10, Getting the Most Out of Kettle, works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle's database repository.
Chapter 11, Utilizing Visualization Tools in Kettle, explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.
Chapter 12, Data Analytics, shows you how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka.
Appendix A, Data Structures, shows the different data structures used throughout the book.
Appendix B, References, provides a list of books and other resources that will help you connect with the rest of the Pentaho community and learn more about Kettle and the other tools that are part of the Pentaho suite.
What you need for this book
PDI is written in Java. Any operating system that can run JVM 1.5 or higher should be able to run PDI. Some of the recipes will require other software, as listed:
Hortonworks Sandbox: This is Hadoop in a box, a great environment to learn how to work with NoSQL solutions without having to install everything.
Web Server with ASP support: This is needed for two recipes to show how to work with web services.
DataCleaner: This is one of the top open source data profiling tools and integrates with Kettle.
MySQL: All the relational database recipes have scripts for MySQL provided. Feel free to use another relational database for those recipes.
In addition, it's recommended to have access to Excel or Calc and a decent text editor (like Notepad++ or gedit).
Having access to an Internet connection will be useful for some of the recipes that use cloud services, as well as making it possible to access the additional links that provide more information about given topics throughout the book.
Who this book is for
If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you! This book does not cover the basics of PDI, SQL, database theory, data profiling, and data analytics.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Copy the .jar file containing the driver to the lib directory inside the Kettle installation directory.
A block of code is set as follows:
lastname
,firstname
,country
,birthyear
Larsson
,Stieg
,Swedish
,1954
King
,Stephen
,American
,1947
Hiaasen
,Carl
,American
,1953
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: clicking the Next button moves you to the next screen
.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.
Chapter 1. Working with Databases
In this chapter, we will cover:
Connecting to a database
Getting data from a database
Getting data from a database by providing parameters
Getting data from a database by running a query built at runtime
Inserting or updating rows in a table
Inserting new rows when a simple primary key has to be generated
Inserting new rows when the primary key has to be generated based on stored values
Deleting data from a table
Creating or altering a table from PDI (design time)
Creating or altering a table from PDI (runtime)
Inserting, deleting, or updating a table depending on a field
Changing the database connection at runtime
Loading a parent-child table
Building SQL queries via database metadata
Performing repetitive database design tasks from PDI
Introduction
Databases are broadly used by organizations to store and administer transactional data such as customer service history, bank transactions, purchases, sales, and so on. They are also used to store data warehouse data used for Business Intelligence solutions.
In this chapter, you will learn to deal with databases in Kettle. The first recipe tells you how to connect to a database, which is a prerequisite for all the other recipes. The rest of the chapter teaches you how to perform different operations and can be read in any order according to your needs.
Note
The focus of this chapter is on relational databases (RDBMS). Thus, the term database is used as a synonym for relational database throughout the recipes.
Sample databases
Through the chapter you will use a couple of sample databases. Those databases can be created and loaded by running the scripts available at the book's website. The scripts are ready to run under MySQL.
Note
If you work with a different DBMS, you may have to modify the scripts slightly.
For more information about the structure of the sample databases and the meaning of the tables and fields, please refer to Appendix A, Data Structures. Feel free to adapt the recipes to different databases. You could try some well-known databases; for example, Foodmart (available as part of the Mondrian distribution at http://sourceforge.net/projects/mondrian/) or the MySQL sample databases (available at http://dev.mysql.com/doc/index-other.html).
Pentaho BI platform databases
As part of the sample databases used in this chapter you will use the Pentaho BI platform Demo databases. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore the capabilities of the Pentaho platform. It relies on the following databases:
By default, all those databases are stored in Hypersonic (HSQLDB). The script for creating the databases in HSQLDB can be found at http://sourceforge.net/projects/pentaho/files. Under Business Intelligence Server | 1.7.1-stable look for pentaho_sample_data-1.7.1.zip. While there are newer versions of the actual Business Intelligence Server, they all use the same sample dataset.
These databases can be stored in other DBMSs as well. Scripts for creating and loading these databases in other popular DBMSs for example, MySQL or Oracle can be found in