Thursday, December 20, 2012

PDI: Pass Parameters to Jobs/Transformations

I had been working on trying to get a process to run for each file. I used the Get File Names step followed by the Copy rows to result step. I had placed this in front of my Text file input step, which is where you define the file for further processing.

That method produced a stream (that's what it's called in PDI) with each and every file and each and every record in those files. If I were just loading that into a table, it would have worked. However, I was assigning a identifier to each file using a database sequence. I needed a sequence for each file, but I wasn't getting it.

With some help and pointers from the ##pentaho IRC channel, I found this post (more on that one in the future), Run Kettle Job for each Row. I downloaded the sample provided to see how it worked.

The calc dates transformation just generates a lot of rows. Not much to see there. The magic, at least for me, was in the run for each row job entry.

Specifically, the Write to log step. (I have this need to see things, since I don't understand everything about the tool yet, Write to log provides me that ability.)

See date, better, ${date}? That's how you reference parameter and variables.

I ran the job and watched the date scroll by. Nice. Then I tried to plug it into my job.

Zippo. Instead of seeing, "this is my filename: /data/pentaho/blah/test.csv" in the log output, I just saw "this is my filename:" Ugh. I went back to the sample and plugged in my stuff. It worked. Yay. Went back to mine, it didn't. Gah! I tried changing the names, then I'd just see "this is my filename: ${new_parameter_name}" so it wasn't resolving to the value.

Finally...after comparing XML for the sample file and mine and finding no real differences, I just about gave up.

One last gasp though, I went to the IRC channel and asked if there was some way to see the job or transformation settings. No one was home. I tried right-clicking to bring up the context menu and there was Job Settings

Job Settings brought up this one:

date is defined there. I checked mine. Nothing defined. Added filename to mine, ran it, Success!

Wednesday, December 19, 2012

Learning Pentaho Data Integrator

aka Kettle, aka PDI.

I've recently taken on a data integration gig and I'll be using Pentaho Data Integrator (PDI). I've read about Pentaho for years but never got around to actually using it. I'm excited about the opportunity.

What is PDI?

...delivers powerful Extraction, Transformation and Loading (ETL) capabilities using an innovative, metadata-driven approach. With an intuitive, graphical, drag and drop design environment, and a proven, scalable, standards-based architecture, Pentaho Data Integration is increasingly the choice for organizations over traditional, proprietary ETL or data integration tools.

I'll be using the enterprise edition (EE), which is supported, similar to how RedHat works...I think.

This post if mainly for me, naturally. I'm going to list out the references I've found so far and add to it over time. Similar to what I do (err, did) for Learning OBIEE.

Actually, I'll just start with the helpful email I received after being added to the account.
I love that you can download and play with the software yourself. Of course the Community Edition (CE) is open source, so that makes sense. I'm not sure if you can get the EE version for free though.

There's a community page as well with links to a lot of great resources. So far, my favorite has to be the IRC channel hosted at freenode. Note, there are two hashtags as in ##pentaho. I've been lurking there for a few weeks and finally got up the nerve (what? me shy?) last week. HansVA and mburgess_pdi helped get me moving again on a particular problem. Good stuff.

I'm sure I'll add more as time goes on. That's it for now.

Update after original posting...
  • (Kettle, aka PDI) Wiki