Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Pentaho Data Integration Cookbook - Second Edition
Pentaho Data Integration Cookbook - Second Edition
Pentaho Data Integration Cookbook - Second Edition
Ebook1,183 pages4 hours

Pentaho Data Integration Cookbook - Second Edition

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Pentaho Data Integration Cookbook Second Edition is written in a cookbook format, presenting examples in the style of recipes.This allows you to go directly to your topic of interest, or follow topics throughout a chapter to gain a thorough in-depth knowledge.Pentaho Data Integration Cookbook Second Edition is designed for developers who are familiar with the basics of Kettle but who wish to move up to the next level.It is also aimed at advanced users that want to learn how to use the new features of PDI as well as and best practices for working with Kettle.
LanguageEnglish
Release dateDec 2, 2013
ISBN9781783280681
Pentaho Data Integration Cookbook - Second Edition

Related to Pentaho Data Integration Cookbook - Second Edition

Related ebooks

Applications & Software For You

View More

Related articles

Reviews for Pentaho Data Integration Cookbook - Second Edition

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Pentaho Data Integration Cookbook - Second Edition - María Carina Roldán

    Table of Contents

    Pentaho Data Integration Cookbook Second Edition

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers and more

    Why Subscribe?

    Free Access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Working with Databases

    Introduction

    Sample databases

    Pentaho BI platform databases

    Connecting to a database

    Getting ready

    How to do it...

    How it works...

    There's more...

    Avoiding creating the same database connection over and over again

    Avoiding modifying jobs and transformations every time a connection changes

    Specifying advanced connection properties

    Connecting to a database not supported by Kettle

    Checking the database connection at runtime

    Getting data from a database

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Getting data from a database by providing parameters

    Getting ready

    How to do it...

    How it works...

    There's more...

    Parameters coming in more than one row

    Executing the SELECT statement several times, each for a different set of parameters

    See also

    Getting data from a database by running a query built at runtime

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Inserting or updating rows in a table

    Getting ready

    How to do it...

    How it works...

    There's more...

    Alternative solution if you just want to insert records

    Alternative solution if you just want to update rows

    Alternative way for inserting and updating

    See also

    Inserting new rows where a simple primary key has to be generated

    Getting ready

    How to do it...

    How it works...

    There's more...

    Using the Combination lookup/update for looking up

    See also

    Inserting new rows where the primary key has to be generated based on stored values

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Deleting data from a table

    Getting ready

    How to do it...

    How it works...

    See also

    Creating or altering a database table from PDI (design time)

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Creating or altering a database table from PDI (runtime)

    How to do it...

    How it works...

    There's more...

    See also

    Inserting, deleting, or updating a table depending on a field

    Getting ready

    How to do it...

    How it works...

    There's more...

    Insert, update, and delete all-in-one

    Synchronizing after merge

    See also

    Changing the database connection at runtime

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Loading a parent-child table

    Getting ready

    How to do it...

    How it works...

    See also

    Building SQL queries via database metadata

    Getting ready

    How to do It...

    How it works...

    See also

    Performing repetitive database design tasks from PDI

    Getting ready

    How to do It...

    How it works...

    See also

    2. Reading and Writing Files

    Introduction

    Reading a simple file

    Getting ready

    How to do it...

    How it works...

    There's more...

    Alternative notation for a separator

    About file format and encoding

    About data types and formats

    Altering the names, order, or metadata of the fields coming from the file

    Reading files with fixed width fields

    Reading several files at the same time

    Getting ready

    How to do it...

    How it works...

    There's more...

    Reading semi-structured files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Master/detail files

    Logfiles

    See also

    Reading files having one field per row

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Reading files with some fields occupying two or more rows

    Getting ready

    How to do it...

    How it works...

    See also

    Writing a simple file

    Getting ready

    How to do it...

    How it works...

    There's more...

    Changing headers

    Giving the output fields a format

    Writing a semi-structured file

    Getting ready

    How to do it...

    How it works...

    There's more...

    Providing the name of a file (for reading or writing) dynamically

    Getting ready

    How to do it...

    How it works...

    There's more...

    Get System Info

    Generating several files simultaneously with the same structure, but different names

    Using the name of a file (or part of it) as a field

    Getting ready

    How to do it...

    How it works...

    Reading an Excel file

    Getting ready

    How to do it...

    How it works...

    See also

    Getting the value of specific cells in an Excel file

    Getting ready

    How to do it...

    How it works...

    There's more...

    Looking for a given cell

    Writing an Excel file with several sheets

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Writing an Excel file with a dynamic number of sheets

    Getting ready

    How to do it...

    How it works...

    See also

    Reading data from an AWS S3 Instance

    Getting ready

    How to do it...

    How it works...

    See also

    3. Working with Big Data and Cloud Sources

    Introduction

    Loading data into Salesforce.com

    Getting ready

    How to do it...

    How it works...

    See also

    Getting data from Salesforce.com

    Getting ready

    How to do it...

    How it works...

    See also

    Loading data into Hadoop

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Getting data from Hadoop

    Getting ready

    How to do it...

    How it works...

    See also

    Loading data into HBase

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Getting data from HBase

    Getting ready

    How to do it...

    How it works...

    See also

    Loading data into MongoDB

    Getting ready

    How to do it...

    How it works...

    See also

    Getting data from MongoDB

    Getting ready

    How to do it...

    How it works...

    See also

    4. Manipulating XML Structures

    Introduction

    Reading simple XML files

    Getting ready

    How to do it...

    How it works...

    There's more...

    XML data in a field

    XML file name in a field

    See also

    Specifying fields by using the Path notation

    Getting ready

    How to do it...

    How it works...

    There's more...

    Getting data from a different path

    Getting data selectively

    Getting more than one node when the nodes share their Path notation

    Saving time when specifying Path

    Validating well-formed XML files

    Getting ready

    How to do it...

    How it works...

    See also

    Validating an XML file against DTD definitions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Validating an XML file against an XSD schema

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Generating a simple XML document

    Getting ready

    How to do it...

    How it works...

    There's more...

    Generating fields with XML structures

    See also

    Generating complex XML structures

    Getting ready

    How to do it...

    How it works...

    See also

    Generating an HTML page using XML and XSL transformations

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Reading an RSS Feed

    Getting ready

    How to do it...

    How it works...

    See also

    Generating an RSS Feed

    Getting ready

    How to do it...

    How it works

    There's more...

    See also

    5. File Management

    Introduction

    Copying or moving one or more files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Moving files

    Detecting the existence of the files before copying them

    Creating folders

    See also

    Deleting one or more files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Figuring out which files have been deleted

    See also

    Getting files from a remote server

    How to do it...

    How it works...

    There's more...

    Specifying files to transfer

    Some considerations about connecting to an FTP server

    Access via SFTP

    Access via FTPS

    Getting information about the files being transferred

    See also

    Putting files on a remote server

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Copying or moving a custom list of files

    Getting ready

    How to do it...

    How it works...

    See also

    Deleting a custom list of files

    Getting ready

    How to do it...

    How it works...

    See also

    Comparing files and folders

    Getting ready

    How to do it...

    How it works...

    There's more...

    Comparing folders

    Working with ZIP files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Avoiding zipping files

    Avoiding unzipping files

    See also

    Encrypting and decrypting files

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    6. Looking for Data

    Introduction

    Looking for values in a database table

    Getting ready

    How to do it...

    How it works...

    There's more...

    Taking some action when the lookup fails

    Taking some action when there are too many results

    Looking for non-existent data

    See also

    Looking for values in a database with complex conditions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Looking for values in a database with dynamic queries

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Looking for values in a variety of sources

    Getting ready

    How to do it...

    How it works...

    There's more...

    Looking for alternatives when the Stream Lookup step doesn't meet your needs

    Speeding up your transformation

    Using the Value Mapper step for looking up from a short list of values

    See also

    Looking for values by proximity

    Getting ready

    How to do it...

    How it works...

    There's more...

    Looking for values by using a web service

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Looking for values over intranet or the Internet

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Validating data at runtime

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    7. Understanding and Optimizing Data Flows

    Introduction

    Splitting a stream into two or more streams based on a condition

    Getting ready

    How to do it...

    How it works...

    There's more...

    Avoiding the use of Dummy steps

    Comparing against the value of a Kettle variable

    Avoiding the use of nested Filter rows steps

    Overcoming the difficulties of complex conditions

    Merging rows of two streams with the same or different structures

    Getting ready

    How to do it...

    How it works...

    There's more...

    Making sure that the metadata of the streams is the same

    Telling Kettle how to merge the rows of your streams

    See also

    Adding checksums to verify datasets

    Getting ready

    How to do it...

    How it works...

    Comparing two streams and generating differences

    Getting ready

    How to do it...

    How it works...

    There's more...

    Using the differences to keep a table up-to-date

    See also

    Generating all possible pairs formed from two datasets

    How to do it...

    How it works...

    There's more...

    Getting variables in the middle of the stream

    Limiting the number of output rows

    See also

    Joining two or more streams based on given conditions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Interspersing new rows between existent rows

    Getting ready

    How to do it...

    How it works...

    See also

    Executing steps even when your stream is empty

    Getting ready

    How to do it...

    How it works...

    There's more...

    Processing rows differently based on the row number

    Getting ready

    How to do it...

    How it works...

    There's more...

    Identifying specific rows

    Identifying the last row in the stream

    Avoiding using an Add sequence step to enumerate the rows

    See also

    Processing data into shared transformations via filter criteria and subtransformations

    Getting ready

    How to do it...

    How it works...

    See also

    Altering a data stream with Select values

    How to do it...

    How it works...

    Processing multiple jobs or transformations in parallel

    How to do it...

    How it works...

    See also

    8. Executing and Re-using Jobs and Transformations

    Introduction

    Sample transformations

    Sample transformation – hello

    Sample transformation – random list

    Sample transformation – sequence

    Sample transformation – file list

    Launching jobs and transformations

    How to do it...

    How it works...

    Executing a job or a transformation by setting static arguments and parameters

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Executing a job or a transformation from a job by setting arguments and parameters dynamically

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Executing a job or a transformation whose name is determined at runtime

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Executing part of a job once for every row in a dataset

    Getting ready

    How to do it...

    How it works...

    There's more...

    Accessing the copied rows from jobs, transformations, and other entries

    Executing a transformation once for every row in a dataset

    Executing a transformation or part of a job once for every file in a list of files

    See also

    Executing part of a job several times until a condition is true

    Getting ready

    How to do it...

    How it works...

    There's more...

    Implementing loops in a job

    Using the JavaScript step to control the execution of the entries in your job

    See also

    Creating a process flow

    Getting ready

    How to do it...

    How it works...

    There's more...

    Serializing/De-serializing data

    Other means for transferring or sharing data between transformations

    Moving part of a transformation to a subtransformation

    Getting ready

    How to do it...

    How it works...

    There's more...

    Using Metadata Injection to re-use transformations

    Getting ready

    How to do it...

    How it works...

    There's more...

    9. Integrating Kettle and the Pentaho Suite

    Introduction

    A sample transformation

    Creating a Pentaho report with data coming from PDI

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating a Pentaho report directly from PDI

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Configuring the Pentaho BI Server for running PDI jobs and transformations

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Executing a PDI transformation as part of a Pentaho process

    Getting ready

    How to do it...

    How it works...

    There's more...

    Specifying the location of the transformation

    Supplying values for named parameters, variables and arguments

    Keeping things simple when it's time to deliver a plain file

    See also

    Executing a PDI job from the Pentaho User Console

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Generating files from the PUC with PDI and the CDA plugin

    Getting ready

    How to do it...

    How it works...

    There's more...

    Populating a CDF dashboard with data coming from a PDI transformation

    Getting ready

    How to do it...

    How it works...

    There's more...

    10. Getting the Most Out of Kettle

    Introduction

    Sending e-mails with attached files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Sending logs through an e-mail

    Sending e-mails in a transformation

    Generating a custom logfile

    Getting ready

    How to do it...

    How it works...

    There's more...

    Filtering the logfile

    Creating a clean logfile

    Isolating logfiles for different jobs or transformations

    See also

    Running commands on another server

    Getting ready

    How to do it...

    How it works...

    See also

    Programming custom functionality

    Getting ready

    How to do it...

    How it works...

    There's more...

    Data type's equivalence

    Generalizing your UDJC code

    Looking up information with additional steps

    Customizing logs

    Scripting alternatives to the UDJC step

    Generating sample data for testing purposes

    How to do it...

    How it works...

    There's more...

    Using a Data grid step to generate specific data

    Working with subsets of your data

    See also

    Working with JSON files

    Getting ready

    How to do it...

    How it works...

    There's more...

    Reading JSON files dynamically

    Writing JSON files

    Getting information about transformations and jobs (file-based)

    Getting ready

    How to do it...

    How it works...

    There's more...

    Job XML nodes

    Steps and entries information

    See also

    Getting information about transformations and jobs (repository-based)

    Getting ready

    How to do it...

    How it works...

    There's more...

    Transformation tables

    Job tables

    Database connections tables

    Using Spoon's built-in optimization tools

    Getting ready

    How to do it...

    How it works...

    There's more...

    11. Utilizing Visualization Tools in Kettle

    Introduction

    Managing plugins with the Marketplace

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Data profiling with DataCleaner

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Visualizing data with AgileBI

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Using Instaview to analyze and visualize data

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    12. Data Analytics

    Introduction

    Reading data from a SAS datafile

    Why read a SAS file?

    Getting ready

    How to do it...

    How it works...

    See also

    Studying data via stream statistics

    Getting ready

    How to do it...

    How it works...

    See also

    Building a random data sample for Weka

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    A. Data Structures

    Books data structure

    Books

    Authors

    museums data structure

    museums

    cities

    outdoor data structure

    products

    categories

    Steel Wheels data structure

    Lahman Baseball Database

    B. References

    Books

    Online

    Index

    Pentaho Data Integration Cookbook Second Edition


    Pentaho Data Integration Cookbook Second Edition

    Copyright © 2013 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: June 2011

    Second Edition: November 2013

    Production Reference: 1151113

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78328-067-4

    www.packtpub.com

    Cover Image by Aniket Sawant (<aniket_sawant_photography@hotmail.com>)

    Credits

    Author

    Alex Meadows

    Adrián Sergio Pulvirenti

    María Carina Roldán

    Reviewers

    Wesley Seidel Carvalho

    Daniel Lemire

    Coty Sutherland

    Acquisition Editor

    Meeta Rajani

    Lead Technical Editor

    Arvind Koul

    Technical Editors

    Dennis John

    Adrian Raposo

    Gaurav Thingalaya

    Project Coordinator

    Wendell Palmer

    Proofreader

    Kevin McGowan

    Indexer

    Monica Ajmera Mehta

    Graphics

    Ronak Dhruv

    Production Coordinator

    Nilesh R. Mohite

    Cover Work

    Nilesh R. Mohite

    About the Author

    Alex Meadows has worked with open source Business Intelligence solutions for nearly 10 years and has worked in various industries such as plastics manufacturing, social and e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in Pentaho and other open source communities to learn, share, and help newcomers with the best practices in BI, analytics, and data management. He received his Bachelor's degree in Business Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.

    First and foremost, thank you Christina for being there for me before, during, and after taking on the challenge of writing and revising a book. I know it's not been easy, but thank you for allowing me the opportunity. To my grandmother, thank you for teaching me at a young age to always go for goals that may just be out of reach. Finally, this book would be no where without the Pentaho community and the friends I've made over the years being a part of it.

    Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.

    He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

    I'd like to thank my lovely kids, Camila and Nicolas, who understood that I couldn't share with them the usual video game sessions during the writing process. I'd also like to thank my wife, who introduced me to the Pentaho world.

    María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelor's degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in Statistics at the University of Buenos Aires (UBA) in Buenos Aires city, where she has been living since 1994.

    She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently, she works for Webdetails, one of the main Pentaho contributors. She is the author of Pentaho 3.2 Data Integration: Beginner's Guide published by Packt Publishing in April 2010.

    You can follow her on Twitter at @mariacroldan.

    I'd like to thank those who have encouraged me to write this book: On one hand, the Pentaho community; they have given me a rewarding feedback after the Beginner's book. On the other side, my husband, who without hesitation, agreed to write the book with me. Without them I'm not sure I would have embarked on a new book project.

    I'd also like to thank the technical reviewers for the time and dedication that they have put in reviewing the book. In particular, thanks to my colleagues at Webdetails; it's a pleasure and a privilege to work with them every day.

    About the Reviewers

    Wesley Seidel Carvalho got his Master's degree in Computer Science from the Institute of Mathematics and Statistics, University of São Paulo (IME-USP), Brazil, where he researched on (his dissertation) Natural Language Processing (NLP) for the Portuguese language. He is a Database Specialist from the Federal University of Pará (UFPa). He has a degree in Mathematics from the State University of Pará (Uepa).

    Since 2010, he has been working with Pentaho and researching Open Data government. He is an active member of the communities and lists of Free Software, Open Data, and Pentaho in Brazil, contributing software Grammar Checker for OpenOffice - CoGrOO and CoGrOO Community.

    He has worked with technology, database, and systems development since 1997, Business Intelligence since 2003, and has been involved with Pentaho and NLP since 2009. He is currently serving its customers through its startups:

    http://intelidados.com.br

    http://ltasks.com.br

    Daniel Lemire has a B.Sc. and a M.Sc. in Mathematics from the University of Toronto, and a Ph.D. in Engineering Mathematics from the Ecole Polytechnique and the Université de Montréal. He is a Computer Science professor at TELUQ (Université du Québec) where he teaches Primarily Online. He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including more than 25 journal articles. He has held competitive research grants for the last 15 years. He has served as a program committee member on leading computer science conferences (for example, ACM CIKM, ACM WSDM, and ACM RecSys). His open source software has been used by major corporations such as Google and Facebook. His research interests include databases, information retrieval, and high performance programming. He blogs regularly on computer science at http://lemire.me/blog/.

    Coty Sutherland was first introduced to computing around the age of 10. At that time, he was immersed in various aspects of computers and it became apparent that he had a propensity for software manipulation. From then until now, he has stayed involved in learning new things in the software space and adapting to the changing environment that is Software Development. He graduated from Appalachian State University in 2009 with a Bachelor's Degree in Computer Science. After graduation, he focused mainly on software application development and support, but recently transitioned to the Business Intelligence field to pursue new and exciting things with data. He is currently employed by the open source company, Red Hat, as a Business Intelligence Engineer.

    www.PacktPub.com

    Support files, eBooks, discount offers and more

    You might want to visit www.PacktPub.com for support files and downloads related to your book.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    http://PacktLib.PacktPub.com

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. 

    Why Subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print and bookmark content

    On demand and accessible via web browser

    Free Access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

    Preface

    Pentaho Data Integration (also known as Kettle) is one of the leading open source data integration solutions. With Kettle, you can take data from a multitude of sources, transform and conform the data to given requirements, and load the data into just as many target systems. Not only is PDI capable of transforming and cleaning data, it also provides an ever-growing number of plugins to augment what is already a very robust list of features.

    Pentaho Data Integration Cookbook Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics working with Big Data and cloud sources, data analytics, and more.

    Pentaho Data Integration Cookbook Second Edition shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The book starts with showing you how to work with data sources files, relational databases, . Then we go into how to work with data streams merging data from different sources, how to take advantage of the different tools to clean up and transform data, and how to build nested jobs and transformations. More advanced topics are also covered, data analytics, data visualization, plugins, and integration of Kettle with other tools in the Pentaho suite.

    Pentaho Data Integration Cookbook Second Edition provides recipes with easy step-by-step instructions to accomplish specific tasks. The code for the recipes can be adapted and built upon to meet individual needs.

    What this book covers

    Chapter 1, Working with Databases, shows you how to work with relational databases with Kettle. The recipes show you how to create and share database connections, perform typical database functions (select, insert, update,delete), as well as more advanced tricks building and executing queries at runtime.

    Chapter 2, Reading and Writing Files, not only shows you how to read and write files, but also how to work with semi-structured files, and read data from Amazon Web Services.

    Chapter 3, Working with Big Data and Cloud Sources, covers how to load and read data from some of the many different NoSQL data sources as well as from Salesforce.com.

    Chapter 4, Manipulating XML Structures, shows you how to read, write, and validate XML. Simple and complex XML structures are shown as well as more specialized formats RSS feeds.

    Chapter 5, File Management, demonstrates how to copy, move, transfer, and encrypt files and directories.

    Chapter 6, Looking for Data, shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle's builtin validation steps.

    Chapter 7, Understanding and Optimizing Data Flows, details how Kettle moves data through jobs and transformations and how to optimize data flows.

    Chapter 8, Executing and Re-using Jobs and Transformations, shows you how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Objectoriented transformations through subtransformations also explained.

    Chapter 9, Integrating Kettle and the Pentaho Suite, works with some of the other tools in the Pentaho suite to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more.

    Chapter 10, Getting the Most Out of Kettle, works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle's database repository.

    Chapter 11, Utilizing Visualization Tools in Kettle, explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.

    Chapter 12, Data Analytics, shows you how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka.

    Appendix A, Data Structures, shows the different data structures used throughout the book.

    Appendix B, References, provides a list of books and other resources that will help you connect with the rest of the Pentaho community and learn more about Kettle and the other tools that are part of the Pentaho suite.

    What you need for this book

    PDI is written in Java. Any operating system that can run JVM 1.5 or higher should be able to run PDI. Some of the recipes will require other software, as listed:

    Hortonworks Sandbox: This is Hadoop in a box, a great environment to learn how to work with NoSQL solutions without having to install everything.

    Web Server with ASP support: This is needed for two recipes to show how to work with web services.

    DataCleaner: This is one of the top open source data profiling tools and integrates with Kettle.

    MySQL: All the relational database recipes have scripts for MySQL provided. Feel free to use another relational database for those recipes.

    In addition, it's recommended to have access to Excel or Calc and a decent text editor (like Notepad++ or gedit).

    Having access to an Internet connection will be useful for some of the recipes that use cloud services, as well as making it possible to access the additional links that provide more information about given topics throughout the book.

    Who this book is for

    If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you! This book does not cover the basics of PDI, SQL, database theory, data profiling, and data analytics.

    Conventions

    In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Copy the .jar file containing the driver to the lib directory inside the Kettle installation directory.

    A block of code is set as follows:

    lastname,firstname,country,birthyear

    Larsson,Stieg,Swedish,1954

    King,Stephen,American,1947

    Hiaasen,Carl ,American,1953

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

       

          City

          Buenos Aires, Argentina

     

       

    New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: clicking the Next button moves you to the next screen.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

    To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title via the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

    Piracy

    Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors, and our ability to bring you valuable content.

    Questions

    You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.

    Chapter 1. Working with Databases

    In this chapter, we will cover:

    Connecting to a database

    Getting data from a database

    Getting data from a database by providing parameters

    Getting data from a database by running a query built at runtime

    Inserting or updating rows in a table

    Inserting new rows when a simple primary key has to be generated

    Inserting new rows when the primary key has to be generated based on stored values

    Deleting data from a table

    Creating or altering a table from PDI (design time)

    Creating or altering a table from PDI (runtime)

    Inserting, deleting, or updating a table depending on a field

    Changing the database connection at runtime

    Loading a parent-child table

    Building SQL queries via database metadata

    Performing repetitive database design tasks from PDI

    Introduction

    Databases are broadly used by organizations to store and administer transactional data such as customer service history, bank transactions, purchases, sales, and so on. They are also used to store data warehouse data used for Business Intelligence solutions.

    In this chapter, you will learn to deal with databases in Kettle. The first recipe tells you how to connect to a database, which is a prerequisite for all the other recipes. The rest of the chapter teaches you how to perform different operations and can be read in any order according to your needs.

    Note

    The focus of this chapter is on relational databases (RDBMS). Thus, the term database is used as a synonym for relational database throughout the recipes.

    Sample databases

    Through the chapter you will use a couple of sample databases. Those databases can be created and loaded by running the scripts available at the book's website. The scripts are ready to run under MySQL.

    Note

    If you work with a different DBMS, you may have to modify the scripts slightly.

    For more information about the structure of the sample databases and the meaning of the tables and fields, please refer to Appendix A, Data Structures. Feel free to adapt the recipes to different databases. You could try some well-known databases; for example, Foodmart (available as part of the Mondrian distribution at http://sourceforge.net/projects/mondrian/) or the MySQL sample databases (available at http://dev.mysql.com/doc/index-other.html).

    Pentaho BI platform databases

    As part of the sample databases used in this chapter you will use the Pentaho BI platform Demo databases. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore the capabilities of the Pentaho platform. It relies on the following databases:

    By default, all those databases are stored in Hypersonic (HSQLDB). The script for creating the databases in HSQLDB can be found at http://sourceforge.net/projects/pentaho/files. Under Business Intelligence Server | 1.7.1-stable look for pentaho_sample_data-1.7.1.zip. While there are newer versions of the actual Business Intelligence Server, they all use the same sample dataset.

    These databases can be stored in other DBMSs as well. Scripts for creating and loading these databases in other popular DBMSs for example, MySQL or Oracle can be found in

    Enjoying the preview?
    Page 1 of 1