Monday, July 27, 2009

IDs (NUMBER) vs. Codes (VARCHAR2)

A slightly different way of putting that is Surrogate vs. Natural keys.

Not that I really want to get into the entire discussion...but I will...a little.

On Friday having a discussion with a developer friend after I showed him my Party model. It's all about boozing it up. OK, wishful thinking on my part, it's just a database design based on the Party (or Entity) Model. I've pondered it here and here.

I'm not planning on debating the merits of the Party Model. I do like it though, it seems to me a natural end point. It does not take into account testing of the design, nor does it take into account places where you might denormalize...and many choose to denormalize by keeping addresses in line with people or organizations.

On to my point.

To maintain data integrity without using Check constraints, I'll create a small lookup table.
  • ADDRESS_CODES

  • PHONE_CODES

  • EMAIL_CODES

  • STATE_CODES
Perhaps a few others. ADDRESS_CODES, as the name suggests, would use a code as the key, a VARCHAR2. HOME, WORK, OTHER, etc. Same goes for the others.

I posted my query on twitter Friday morning and received a pretty solid response from 4 people.



The response was swift:



Which was exactly what my conversation with my friend had led to.

In this case, why would Natural keys, VARCHAR2s, be bad.

1. If a user enters them in, they could misspell something, CELL would be CLEL or something. Fair enough.
2. If #1 happens, you just fix the name field instead of the key.

So I started to come around to it, until I got to STATES. States have a 2 letter abbreviation. Why not use that? It's not like they'll change (hah!) right?

Perhaps a distinction is in order then. If it's user entered content, then use a surrogate key. If not, use the supplied code.

I don't know. Is this just my personal preference? Is there a standard? What do you do?

8 comments:

jpiwowar said...

>States have a 2 letter abbreviation.
>Why not use that? It's not like
>they'll change (hah!) right?

Depends on the scope of your data model, I guess. German Länder may not conform to your 2-character standard. :-)

I get your point, though, and that's why data modeling can be fun, right up to the point that it makes you want to tear out your hair. I still like the surrogate in this case, but that might be dangerously close to the dreaded Rule Of Thumb...

SydOracle said...

In the case of states, you want each state in there exactly once. So you want a unique constraint on upper(state_name). And maybe some regexp constraint to make sure you don't get 'South Dakota' and 'South-Dakota' in there. The state name may change.

Then I'd have the official US postal service abbreviation as an attribute. There is a good chance you'll end up using it to send out letters, after all.
Checking
http://www.usps.com/ncsc/lookups/abbreviations.html#states
I see that the abbreviation isn't actually unique for the "military states" (armed forces overseas), so that can't be a key.

On the other hand, there's no need for a surrogate key to be numeric (eg GUIDs). If the table has fixed content, there's no requirement for the surrogate to be system generated (eg a sequence).

For an even simpler example, how about the 'codes' M and F for Male and Female. [I've worked with Medical Apps that also had U for Unknown and O for Other, but lets keep it simple.]

Now, say I expect to offer a vaccination to females aged 18. Is it better to code that with
1) a hard-coded value of 'F'
2) a hard-coded value of 'FEMALE', check the surrogate key for that description and use that. Bear in mind, they may change the description to 'Female' and it will break.
3) a hard-coded value of an arbitrary numeric surrogate key.

And we can even argue about the benefit of it being hard-coded as a literal or a parameter/bind variable :)

oraclenerd said...

@jpiwowar

Yeah, tear your hair out is right. How much time do you spend trying to make it bullet-proof? Naturally, obvious things like this...but a better solution might be to get the USPS data and just use that.

oraclenerd said...

@gary

How did I know you were going to say that? I went to that exact link you provided...and conveniently ignored the military codes.

You bring up a very good point though, something that I haven't been quite able to articulate.

Is SELECT * FROM people WHERE gender = 'M' better/worse than SELECT *
FROM people
WHERE gender = ( SELECT gender_id
FROM gender_tables
WHERE gender_desc = 'F' );

????

moleboy said...

I think there's also the issue of consistancy.
So, I have phone types of 1,2,3 instead of Cell, Home, Office.
Cool. Makes sense. I might want to change Office to Work, Cell to Mobile, etc, as you noted in the original post.
In the case of states, yeah, you cna completely use the MA, NH, SD etc, but then you've got some tables designed one way, and others designed another way.
Thats fine if there's a benefit, but there really isn't.

Regarding the gender issue, if you are assuming that you know the 'code' for the gender in the query, then you can just as easily write
SELECT * FROM people WHERE gender = '1'
The question is, do you know the code (F, 1, whatever) or do you only know what it represents?
For example, you might know that, right now, the description for women is FEMALE.
You have no idea what it used to be, (maybe it was W for WOMAN, or, something really strange), but you know now that the description is 'F'or 'FEMALE'.
No matter how off the code itself might be, you know what things are now.
OK, wow, that last bit was a bit unfocused, more coffee please.

oraclenerd said...

@moleboy

I'm not necessarily buying the consistency thing. If it makes sense, the codes are immutable, why not? Makes the query a bit easier to read plus you avoid a possible lookup when inserting records. I think I'm going to have to add a Part II to this one.

Is the devil in the details? I don't know...

Noons said...

Missed this one, but I'll still comment asd it is very pertinent.

Put me squarely on the surrogate key side.

Maybe not for everything. I actually expand the "user-entered=>surrogate, non-user-entered=>natural" selection a bit more:

- displayed in any shape of form, then use surrogate.

- non-displayed (not even in hidden variables), then use natural keys.

Why? Let me give you an example.

An external services org recently made a "confidential" employee survey at our company.

When they published the report with the results, they had a web page with the names of each manager/director as a-links, with the corresponding details page for each person as the resulting page.

What was in the report's page code for each of the hyperlinks? A piece of javascriopt code with the directly coded employee number of each and every director!

Which means now anyone could pick that empno, go to the survey entry screen, type a query in the url with the number and read their survey entry!

Not bad for a "confidential survey", eh?

All because someone forgot to tell coders that employee confidential information used as natural keys and shoved into a public page might be a bad idea...

oraclenerd said...

@noons

Wow.

Wow.

That sounds like a Daily WTF entry to me.

It also sounds like a perfectly reasonable advertisement for surrogate keys.

- non-displayed (not even in hidden variables), then use natural keys.

That's one I was thinking about addressing in a future post. Here's my example, you have a TRANSACTION table, the kind where you store financial transactions (a ledger perhaps?).

You store the absolute value of the currency. But I would want a DEBIT or CREDIT attribute to figure out if it's a postive or negative. Further, is it CASH, CREDIT_CARD, PAYPAL or other? I prefer natural keys on these tables as it makes it easier for the app dev people (PL/SQL or other) to INSERT the data. They don't have to worry about which key it is, they can just put the literal (well, not a magic constant) in there.

These natural keys would be maintained solely by DBA/Developer types, never to be seen by the light of day (minus reporting perhaps).