Friday, March 11, 2011

What is Big Data?

Had a question today about Big Data and I didn't really know how to define it.

Is it defined by volume? Size?

Structured, unstructured (content management systems)?

Naturally, the first thing I did was ask the Twitter Machine. It's like sitting in a giant room and yelling out questions to a wide variety of people, in other words, fun.

From the one year older John Piwowar:

Definitely agree with that sentiment. It is just ones and zeros. Lots of it, apparently.

I think my question goes deeper though. I have a pretty good understanding of structured data, i.e. that used by the majority of business applications. You know, you define a data model, it changes over time, no big deal. But...and it's a giant but, what about all that unstructured stuff?

Next up, future beer drinking buddy and obvious fellow smart-ass.

A couple of non-believers stroll in...

I sometimes wonder, do we, as in database people, who work in the data information business, somehow miss the boat on things like Big Data? I can't imagine that's possible. We know the importance of there a disconnect? Is it just me? That ain't a swipe at the above people either.

Gary chimes in with a more...philosophical answer? I've read Gary's stuff for a few years and once in awhile, he just goes way over my head...

Ted comes in and follows up to me:

There were definitely more, but I think these were the highlights for me.

Then it was off to the Google Machine. I love The Google.

Wikipedia to the rescue:

A Definition:
Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.

An Example:
Examples include web logs, RFID, sensor networks, social networks, Internet text and documents, Internet search indexing, call detail records, genomics, astronomy, biological research, military surveillance, medical records, photography archives, video archives, and large scale eCommerce.

Definitely coming together now...and we're back to John's point, it's just ones and zeros.

I'll let you use The Google Machine yourself as well. Lots of good articles.

For a really quick (4 minutes) primer, here is some dude from O'Reilly

Thanks everyone for their responses on twitter. Definitely helped to clear up some of my confusion.


Martien van den Akker said...

Hi "Oraclenerd",
Some time ago in the Netherlands we had a campaign of a foodcompany that came up with a product with a so-called "longer keeping-date". And I thought how can a date be "longer"? It's how you define it, but at most 24 hours, 24x60 minutes, ...
Or you have to spell out the months like 3 august 2011 in stead of 3/08/11. But I understand that is not what they meant...

Bradd Piontek said...

I've got your Big Data right here

SydOracle said...

The conciseness of a tweet isn't my medium :)
I was channeling Curt Monash.

And there are more than the big three. As a side thought, Google Search is machine gathered data. GMail will be big on human generated data as well as all that mass generated email.

Wonder how much of Big Data is an aggregation of data held elsewhere

oraclenerd said...


thanks for the links, good stuff.

definitely an interesting subject.

EscVector said...

In a nutshell it's about distributed processing very large data sets on commodity hardware, typically with Open Source tools such as Hadoop/Hive. It's about watching and analyzing the flow of statistically relevant data in near real time. It's also about not doing it for millions of dollars.

EscVector said...

OSCon Data will is about Big Data. Maybe I'll see you there.

SydOracle said...

EscVector's comment is a bit too solution oriented for me. "Distributed Processing", "Commodity hardware", "Open Source" smack of "This is the solution, now what's the problem."

There are some Big Data problems that fit the Exadata model of specialized, closely integrated hardware and proprietary software. Sure it is expensive, but there are very few people who will opt for an expensive solution in preference to an equally capable cheap solution. Which suggests to me that @EscVector's solution is only applicable to a subset of Big Data problems.