I was recently having a conversation that got me thinking philosophically about data. It reminded me of an article I read by Kevin Shelly about Mark Logic’s No SQL Database.
In the article Kevin states
Today 80% of the world’s data is unstructured. There’s a new generation of enterprise software, which can aggregate all of the data, structured and unstructured, in real time, and deliver the right information, to the right people, at the right time.
While that is some great marketing speak I would argue that you cannot aggregate unstructured data without first applying a structure. I believe data itself is not unstructured – it most definitely can be described within a particular context and much of what we, as data professionals, do is try and determine the appropriate structure based on what we know about the domain (or we build software that tries to structure it for us). In many cases, a decision is merely being made as to when to apply that structure to the data: before storing it, in an application, in a reporting layer, during analysis, or perhaps never.
There are a variety of considerations for choosing NoSQL databases, mostly revolving around throughput, ACID, transactional consistency, etc.
At the end of the day, in order to turn data into useful information someone still has to do the work of mining and “structuring” the data. The question is should that structuring and cleanup occur before storing the data or afterwards.
What do you think?
Let us know in the comments below.