Showing posts with label etl. Show all posts
Showing posts with label etl. Show all posts

Tuesday, February 1, 2011

An External Table Definition

Yes, I know these examples are everywhere...but this is:
1. The first time I had used fixed-width column specifications
2. NULLIF

I'll trim it down for the post, but you can find the file here.

I may have mentioned before, but I'm working on putting together a dimensional model for an old Florida Medicaid database. The goal is to provide this to researchers so that they can, umm, research, Autism. Of course this one is near-and-dear to my heart.

So here we go...
CREATE TABLE ext_claim
(
recipient_id VARCHAR2(12),
month_of VARCHAR2(4),
bucket VARCHAR2(2),
date_of_service DATE,
date_of_payment DATE,
claim_count INTEGER,
units INTEGER,
...snip...
ndc_code VARCHAR2(11),
record_type VARCHAR2(1),
county VARCHAR2(2)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY ag_file_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
FIELDS
(
recipient_id (1:12) CHAR(12),
month_of (13:16) CHAR(4),
bucket (17:18) CHAR(2),
date_of_service (19:26) CHAR(8) DATE_FORMAT DATE mask "YYYYMMDD",
date_of_payment (27:34) CHAR(8) DATE_FORMAT DATE mask "YYYYMMDD",
claim_count (35:35) CHAR(1),
units (36:40) CHAR(5) NULLIF( units = '*****' ),
...snip...
ndc_code (143:153) CHAR(11),
record_type (154:154) CHAR(1),
county (155:156) CHAR(2)
)
)
LOCATION ('CLAIM_200811.txt')
);
As to the fix-width columns, that was relatively easy to figure out, I just followed the docs.

After creating the definition, I would issue a SELECT COUNT(*) FROM ext_claim to see how many records came back (and to see if any of them didn't match my specification). At first I just changed all the INTEGER, DATE and NUMBER fields to VARCHAR2 data types, it was easier. Of course that's cheating and I have about 100 of these files to load.

At first I wondered if I could include an Oracle function like REPLACE, but searching through the docs didn't return anything, nor did The Google Machine. I looked at the list of Reserved Words though, and found NULLIF. Could it be that easy?

Sure enough it was. After the column specification I just added
NULLIF( units = '*****' )
and it worked. Too easy.

Thursday, January 27, 2011

Google Refine

I can't remember exactly where I found this, probably Twitter.

From the website:
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

I installed it this afternoon and played around with it a little bit. I was hoping to use it more for analysis, but I just didn't understand what it was built for.

Basically, it allows you to clean up sets of data. You know, you get an excel file from a customer and you want to make sure all instances of State match. Usually, you get a few different variations of it.

- FL
- Fl
- Florida
- FLORIDA
- florida
- FLOR.

I am sure I could go on and on. But you can select all those values and then update them with a single value. Yes, I know you could update those values in a single SQL statement. Perhaps you don't have time to create the table and load the data. This is a simple tool to allow you to do some basic data cleansing.

Check out the video to get a better example of how to use it. Cool stuff.