Firstly, I don't tweet. Alex (Gorbachev) mentioned it at a Sydney meetup. I had a look, but didn't get entrenched and I assume there will be others out there who aren't tweeters. Suffice to say a 'tweet' is a message broadcast by a twitter user to the twitter consumers. They are up to 140 characters long.
So what's to normalise ? Isn't it just a value ? Even Oracle 6 could cope with VARCHAR2(140).
But actually, a tweet isn't just a simple value.
A search for "beer" would turn up all messages that included #beer.Source: Wikipedia
Similarly, the @ sign followed by a username allows users to send messages directly to each other. A message with @example would be directed at the user [example] although it can still be read by anyone.
First normal form states
There's no left-to-right ordering to the columns.Source: Wikipedia
Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].
The problem is that the tweet "@tom Come for a #beer or #burger. Don't let @harry come" definitely has hidden components, but there is a sequencing in the message that is just as important.
In a practical implementation, we would probably have the following tables:
TWITTER_USERS : username (eg @tom), date_joined, email....Our message would have the two child tag records (#beer and #burger) and two child destination records (@tom and @fred).
TAGS : tag (eg #beer)
TWEETS : tweet_id (surrogate key), created_by (referencing twitter_user), created_timestamp, tweet_text...
TWEET_TAGS : tweet_id, tag (eg #beer)
TWEET_DESTINATIONS : tweet_id, username (eg @tom)
At the logical level, we are not properly normalized because we have the tweet_text duplicating information from the child entities and the potential for inconsistencies between them. We can say that the tweet just seems to contain duplicate information but it is really different. Is that just being picky ?
I am not suggesting the relational model is wrong, broken, incomplete or inadequate. Quite the reverse, in fact. In this case the value of the model is that it tells us the problems that will arise when we denormalise data.
For example, if @harry deletes his twitter account (because he was never invited for beers), do we delete the tweet_dest that referred to him or do we keep it and not enforce that referential integrity constraint ? If we delete the tweet_dest, we have an inconsistency between the tweet_text attribute and the tweet_dest child entities. Or maybe we delete the tweet entity itself and all its children. Those are really choices for the business (possibly with some legal implications though).
I don't have a solution to the logical model representation, and would be interested in feedback. But not by twitter please :)