How Do You Normalize a Tweet?
Second post by Mr. Myers, you can read his first one, How To Kill a Code Review here. I have always liked design topics, I don't think they are covered enough on the web, which is why I liked this one. I have often wondered what trade-offs designers make for these types of applications (Twitter, Facebook, etc). Are they even "designed" by a data modeler? Or are they created by application developers? Not really sure it matters to those companies as they are successful (in a strange, no business model kind of way) and, I don't believe, represent many of the realities that we as Oracle professionals are likely to deal with.Firstly, I don't tweet.
Alex (Gorbachev) mentioned it at a Sydney meetup. I had a look, but didn't get entrenched and I assume there will be others out there who aren't tweeters. Suffice to say a 'tweet' is a message broadcast by a twitter user to the twitter consumers. They are up to 140 characters long.
So what's to normalise ? Isn't it just a value ? Even Oracle 6 could cope with VARCHAR2(140).
But actually, a tweet isn't just a simple value.
A search for "beer" would turn up all messages that included #beer.
Similarly, the @ sign followed by a username allows users to send messages directly to each other. A message with @example would be directed at the user [example] although it can still be read by anyone.
Source:
WikipediaFirst normal form states
There's no left-to-right ordering to the columns.
Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].
Source:
WikipediaThe problem is that the tweet "
@tom Come for a #beer or #burger. Don't let @harry come" definitely has hidden components, but there is a sequencing in the message that is just as important.
In a practical implementation, we would probably have the following tables:
TWITTER_USERS : username (eg @tom), date_joined, email....
TAGS : tag (eg #beer)
TWEETS : tweet_id (surrogate key), created_by (referencing twitter_user), created_timestamp, tweet_text...
TWEET_TAGS : tweet_id, tag (eg #beer)
TWEET_DESTINATIONS : tweet_id, username (eg @tom)
Our message would have the two child tag records (#beer and #burger) and two child destination records (@tom and @fred).
At the logical level, we are not properly normalized because we have the tweet_text duplicating information from the child entities and the potential for inconsistencies between them. We can say that the tweet just seems to contain duplicate information but it is really different. Is that just being picky ?
I am not suggesting the relational model is wrong, broken, incomplete or inadequate. Quite the reverse, in fact. In this case the value of the model is that it tells us the problems that will arise when we denormalise data.
For example, if @harry deletes his twitter account (because he was never invited for beers), do we delete the tweet_dest that referred to him or do we keep it and not enforce that referential integrity constraint ? If we delete the tweet_dest, we have an inconsistency between the tweet_text attribute and the tweet_dest child entities. Or maybe we delete the tweet entity itself and all its children. Those are really choices for the business (possibly with some legal implications though).
I don't have a solution to the logical model representation, and would be interested in feedback. But not by twitter please :)
Labels: design, gmyers