Monday, December 21, 2009

How Do You Normalize a Tweet?

Second post by Mr. Myers, you can read his first one, How To Kill a Code Review here. I have always liked design topics, I don't think they are covered enough on the web, which is why I liked this one. I have often wondered what trade-offs designers make for these types of applications (Twitter, Facebook, etc). Are they even "designed" by a data modeler? Or are they created by application developers? Not really sure it matters to those companies as they are successful (in a strange, no business model kind of way) and, I don't believe, represent many of the realities that we as Oracle professionals are likely to deal with.

Firstly, I don't tweet. Alex (Gorbachev) mentioned it at a Sydney meetup. I had a look, but didn't get entrenched and I assume there will be others out there who aren't tweeters. Suffice to say a 'tweet' is a message broadcast by a twitter user to the twitter consumers. They are up to 140 characters long.

So what's to normalise ? Isn't it just a value ? Even Oracle 6 could cope with VARCHAR2(140).

But actually, a tweet isn't just a simple value.
A search for "beer" would turn up all messages that included #beer.
Similarly, the @ sign followed by a username allows users to send messages directly to each other. A message with @example would be directed at the user [example] although it can still be read by anyone.
Source: Wikipedia

First normal form states
There's no left-to-right ordering to the columns.

Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).

All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].
Source: Wikipedia

The problem is that the tweet "@tom Come for a #beer or #burger. Don't let @harry come" definitely has hidden components, but there is a sequencing in the message that is just as important.

In a practical implementation, we would probably have the following tables:
TWITTER_USERS        : username (eg @tom), date_joined, email....
TAGS : tag (eg #beer)
TWEETS : tweet_id (surrogate key), created_by (referencing twitter_user), created_timestamp, tweet_text...
TWEET_TAGS : tweet_id, tag (eg #beer)
TWEET_DESTINATIONS : tweet_id, username (eg @tom)
Our message would have the two child tag records (#beer and #burger) and two child destination records (@tom and @fred).

At the logical level, we are not properly normalized because we have the tweet_text duplicating information from the child entities and the potential for inconsistencies between them. We can say that the tweet just seems to contain duplicate information but it is really different. Is that just being picky ?

I am not suggesting the relational model is wrong, broken, incomplete or inadequate. Quite the reverse, in fact. In this case the value of the model is that it tells us the problems that will arise when we denormalise data.

For example, if @harry deletes his twitter account (because he was never invited for beers), do we delete the tweet_dest that referred to him or do we keep it and not enforce that referential integrity constraint ? If we delete the tweet_dest, we have an inconsistency between the tweet_text attribute and the tweet_dest child entities. Or maybe we delete the tweet entity itself and all its children. Those are really choices for the business (possibly with some legal implications though).

I don't have a solution to the logical model representation, and would be interested in feedback. But not by twitter please :)

4 comments:

mcohen01 said...

How ironic that you chose Twitter for this little exploration. It's ironic because Twitter no longer really uses a database, other than as a backup. Everything is in RAM.

Probably much quicker than many believe, this paradigm of "data in the cloud" will become the norm rather than the exception. And, for better or for worse, the relational model and RDBMS's will become the modern version of COBOL.

SydOracle said...

That was the point. The relational model was devised for repetitively structured data, that would be processed automatically. Tweets are less structured and processed 'informally' by people.

While I believe you are correct that, in the future, more data will be held non-relationally, that doesn't mean there will be less data held in relational form. It's just that more data will be held in total.

Formal, automatic, data processing will require more structured data.

PS. If you believe every tweet ever sent is in RAM, I suspect you are very wide of the mark. I'd be surprised if a high percentage of tweets more than a day old are anywhere except on disk.

mcohen01 said...

"While I believe you are correct that, in the future, more data will be held non-relationally, that doesn't mean there will be less data held in relational form. It's just that more data will be held in total."

The total number of bits on disk arranged according to the relational model may be lesser or greater, but the fact is that the relational model is already becoming obsolete. As with all trends in technology, the pace of this retirement will only accelerate.

"Formal, automatic, data processing will require more structured data."

Data processing is perfectly feasible with unstructured data, it's just typically easier with structured data. Of course, "structured data" takes many forms. The point is, the relational model is rapidly giving way to other forms of structuring data.

"PS. If you believe every tweet ever sent is in RAM, I suspect you are very wide of the mark. I'd be surprised if a high percentage of tweets more than a day old are anywhere except on disk."

I'm not sure why you would say that. Saying it's in RAM doesn't mean everything exists in RAM on every node in the cluster. There are plenty of systems with everything in RAM that have a lot more data than every tweet ever sent.

SydOracle said...

"The point is, the relational model is rapidly giving way to other forms of structuring data."
So ? Whether you are using Relational, Dimensional, XML, Key-Value or even find a way of getting Object Oriented databases to actually work, the same questions need to be answered. What are the identifiers, what are the attributes, what are the dependencies, what can be changed, how do you search for data, how do you write it. Seen it, done it.

The data-modelling task changes only a small amount for implementation details (though would be a lot harder for key-value because it has to anticipate more at the design stage). If you think any of these issues goes away just because it is or can be held in RAM, you need to read up on the details.