Featured Post

In-Memory Technology and Big Data

In my previous blogs I wrote about the Big Data and the related keywords and technologies such as unstructured data, Hadoop HDFS, MapReduce, etc . In this post I am looking at what “in-memory technology” brings in to help analysing the big data. Business Intelligence is all about getting...

Read More

Big Data and BI

Posted by Anahita | Posted in Big Data, Business Intelligence | Posted on 05-10-2014

Tags: , , , , ,


The main and the most important characteristics of big data have been summarised i n the three Vs: Velocity, Volume and Variety. It is data in the volume that increases fast with variety of contexts, some of them not impossible to explore easily in the world of relational databases.

Business Intelligence is about providing data to business so that actionable insight can be achieved in timely manner. Considering the 3V nature of Big Data as explained above, it is crucial to ask the right questions, and find the correct way to collect, cleanse and make available for further discovery.

Apache Hadoop is an ecosystem in which distributed  commodity hardware combined with computational power of MapReduce and YARN, provide the essential ingredient for working with Big Data. MapReduce provides computational power for analysing unstructured data. It uses datasets with key-value pairs as both input and output.

Azure HDInsight is the service in the cloud that provides Hadoop framework, combining other Apache projects such as HDFS, MapReduce, Hive, Pig and Oozie.

The storage used by Azure HDInsight is Azure Blob storage.  The HDInsight clusters can be used when required and dropped after the computational tasks are completed. Blob storage can be used to keep the data after the HDInsight clusters are dropped. Blob storage has interface to HDFS file system. The Sqoop connectoors can be used ti import data from an Azure SQL database to HDFS or to export data from the HDFS to Azure SQL database.



For Business Intelligence. Microsoft Power Query Excel provides ability to import data from Azure HDInsight or any HDFS into Excel. This will provide the enhancements for data discovery and blending by enabling access to a wider range of data sources.

Further to Excel Power Query, the Microsoft Hive  ODBC Driver can be used with other Microsoft  Business Intelligence products such as Excel, SSIS and SSRS to provide an integrated solution.


Apache Tez

Posted by Anahita | Posted in Big Data | Posted on 28-12-2013

Tags: , , ,


Apache Tez, part of Stinger Initiative, is a Hadoop framework for near real-time big data processing. As opposed to MapReduce who created bulk data processing capability,  Tez provides a powerful interactive framework for running queries in Apache Hive, and Apache Pig, providing faster response times and throughput.

In  fact Apache Tez is a Hadoop data processing framework utilising DAG (Directed Acyclic Graph) for execution of complex tasks. This means Tez models data processing jobs as a data flow graph. This is similar to PIG Latin scripts, where the edges of the graph represent data flows and the vertices are operators that process data. The logic that modifies or moves the data is represented in vertices. Tez realises the logical graphs into physical at the time of execution on the cluster, applying parallelism at the vertices for scaling to the required data for processing.