The name “Unstructured Data” does not somehow define the type of data it refers to.
Generally when organisations use systems and applications, there is a database in the back end and mainly in “Relational” format. In a “Relational” database, data is designed to be saved in tables that relate to each other in a way that follow certain rules, called normal forms. This is a database design model that guarantees the users of the corresponding systems, such as ERP systems, to insert, amend and delete data in a quickest way. This is all about performance of applications and the related screens.
But normalised data in relational databases are not very good when the data is to be queried. To solve this problem, the relational data designers use indexes and other methods for querying and displaying the relational data, but use of so many indexes will reduce the performance of the system and so this is not an effective way when reports are required on historical data.
To solve this problem, data often is remodelled as dimensional and saved into another database, usually a data warehouse or a data mart.
All said the data saved in the systems and relational databases are a fraction of the information held in an organisation. Any data that is not saved into a relational or dimensional database, is referred to as “Unstructured Data”, despite the fact that these data may well have structures related to them!
Two examples of unstructured data that still have related structure are documents in file system and body of the emails. As certainly there is structure to file systems as well as data related to the information in body of emails, these data cannot be considered aas data with no structured, but still categorised as unstructured data. Other examples of unstructured data are Microsoft Office files, such as Word documents, Excel Spreadsheets, Visio Diagrams, pictures, scans, videos, webcasts, web data including social networks such as facebook and twitter, wikis, web blogs, and any text or picture data saved in any format such as logs.
Statistics shows that less than 20% of data in organisations are relational and so the remaining data is saved and kept outside a relational database and considered as unstructured data. The velocity of growth in unstructured data is faster and the variety and volume is also way higher than the relational data.
Up to now, it was physically impossible to use any sort of analysis on unstructured data due to its volume and variety. This issue is now becoming less of a problem, with new advanced methodologies in distributed computing.
In my next post, I will explain Apache Hadoop and how this can come to rescue to create amazing ways to analyse the “Big Data“.