Featured Post

ERP, BI and UML 2.0

Enterprise Resource Management (ERP) systems are  organisational platforms for coordination of organisational processes and supporting data in order to provide cohesive and timely services by providing integration of HR, Finance, Manufacturing, Supply Chain and Customer Services as core activities....

Read More

Apache Tez

Posted by Anahita | Posted in Big Data | Posted on 28-12-2013

Tags: , , ,


Apache Tez, part of Stinger Initiative, is a Hadoop framework for near real-time big data processing. As opposed to MapReduce who created bulk data processing capability,  Tez provides a powerful interactive framework for running queries in Apache Hive, and Apache Pig, providing faster response times and throughput.

In  fact Apache Tez is a Hadoop data processing framework utilising DAG (Directed Acyclic Graph) for execution of complex tasks. This means Tez models data processing jobs as a data flow graph. This is similar to PIG Latin scripts, where the edges of the graph represent data flows and the vertices are operators that process data. The logic that modifies or moves the data is represented in vertices. Tez realises the logical graphs into physical at the time of execution on the cluster, applying parallelism at the vertices for scaling to the required data for processing.



Apache Sqoop

Posted by Anahita | Posted in Business Intelligence | Posted on 24-11-2013

Tags: , , , , , ,



Apache Sqoop transfers bulk data between Apache Hadoop and relational datastores. Sqoop is used for importing the data into HDFS, or related similar datastores such as HBase or Hive. It is also used for bulk export of data from HDFS or similar datastores such as Hive and HBase into relational databases such as HSQLDB.

Sqoop provides a more efficient way of data analysis.


Hadoop Explained!

Posted by Anahita | Posted in Business Analytics, Technology | Posted on 09-12-2012

Tags: ,


Big Data Solution: IBM InfoSphere BigInsights

Posted by Anahita | Posted in Technology | Posted on 20-01-2012

Tags: , , , ,


Big Data again, as this subject fascinates me. Looking into tools and technologies, I already posted about open source Apache Hadoop projects, HDFS and MapReduce.


IBM offers two editions of its InfoSphere BigInsights which are compatible with Apache Hadoop ecosystem for handling the big data.

The Basic Edition of InfoSphere BigInsight is a free download edition that includes a fully integrated and compatible version of Apache Hadoop  and related components. This comes with a web based management console and the ability to integrate with IBM InfoSphere Warehouse, IBM Smart Analytics System and finally DB2 for Windows, Unix and Linux. Complete with Jaql, a SQL like query language for both structured and non-traditional data types.

The Enterprise Edition supports structured, semi-structured and non-structured data, and massive data scale out, while running on commonly available hardware. An enterprise class management including job management  and security features including Active Director/LDAP authentication.

More details on editions and pricing is available from the IBM website.






Big EDW!

Posted by Anahita | Posted in Agile, Business Intelligence, Data Warehouse, Technology | Posted on 09-01-2012

Tags: , , , , , , , , , ,


Big Data is changing the way we need to look at Enterprise Data Warehousing. Previously I posted about big data  in Big Data – Volume, Variety and Velocity!. I also posted about the supporting projects from Apache Hadoop, such as Hbase and Hive in Big Data, Hadoop and Business Intelligence. Today I want to introduce a new concept, or better say an original idea. Big EDW!  Yes, Business Intelligence and Data Warehousing also will have to turn to Big BI and Big EDW!

So what makes the fabric of Big EDW and Big BI Analytics? The answer is the ability to analyse and make sense of Big Data, which covers not only the 20% of the structured data that organisations keep on their relational and dimensional databases, but also the vast remaining 80% unstructured data scattered in digital and web documents such as Microsoft Word, MS Excel, MS PowerPoint, MS Visio,  MS Project, as well as web data such as social media, wikis, web sites and other formats such as pictures, videos, and log files. I have posted about the meaning of unstructured data  previously  in On Unstructured Data.

Traditionally Enterprise Data Warehouse is a centralised Business Intelligence System, containing the required ETL programs to access various data sources,   transformation and load into a well designed dimensional model.  The front end BI access tools such as reporting, analytical and dashboards then is used on their own or integrated with the organisations interanet, to give the right users timely access to relevant information for analysis and decision making activities.

The Big Data does not quite  fit into this model for three main reasons, volume, variety and velocity of change and growth. Big EDW will need to break some of the traditional data warehousing concepts, but once done, it will create value that has many folds of magnitude.

Big EDW, should have the ability to be quick and agile in dealing with Big Data. It has to make it available for quick access to many new available data sources  in high volume. Enhanced design patterns or new use cases  have to emerge to make this possible. These patterns and use cases  should make use of more intelligent and faster methods of providing the relevant data when  required. This could be achieved by many methods such as  dimensional modelling, advanced mathematical/statistical models such as bootstrap and jackknife sampling to provide more accurate results for more accurate approximation for mean. median, variances, percentiles and standard deviation of big data.   Apache Hadoop  plays an essential role with projects such as  MapReduce, HDFS, HSQL (Hive SQL) and HBase. New central monitoring tools should be developed and embedded within the Big EDW to handle big data metadata such as social media sources, text analysis, sensor analysis, search ranking, etc.  Parallel Machine Learning and Data Mining, being looked at recently via projects such as Apache Mahout and Hadoop-ML combined with Complex Event Processing (CEP), amongst faster SDLC and project methodologies such as agile scrum for handling the Big EDW life cycle are also becoming standard in the realm of Big EDW.

Note that the phrase “Big EDW”  is not used anywhere else and is the naming that I thought could fit EDW growth in to a system that can also accommodate and manage  Big Data!








Big Data, Hadoop and Business Intelligence

Posted by Anahita | Posted in Business Intelligence | Posted on 17-12-2011

Tags: , , ,


I consider Hadoop as one of the technologies that creates a  link between Big Data Analytics and  Business Intelligence . In my previous posts I explained what Big Data means and what was the meaning of Unstructured Data. In this post I would like to introduce Hadoop, which makes it possible to gain business value from the Big Data.

Apache Hadoop is an open source project, providing software for reliable and scalable distributed computing. A simple programming model provides the ability for the distributed  processing of large data sets.  This is achieved by using a cluster of distributed processing and storage and so make it possible for Hadoop to easily scale up as required. Hadoop consists of three subprojects: Hadoop Common, Hadoop Distributed Files System (HDFS) and finally Hadoop MapReduce. Hadoop ecosystem of products also include derived technologies that could be used on their own or together to achieved the desired outcomes. Some of these related projects are Hive, Hbase, Zookeeper, etc For more details on each of the above projects, please visit http://hadoop.apache.org/

Core Hadoop is HDFS and MapReduce.

HDFS is Hadoop Distributed File System and is used as a utility in Hadoop projects to distribute data blocks to nodes in cluster which results in extremely fast computation.

MapReduce is an algorithm that makes it possible to perform parallel computing across the nodes in a cluster.

For Business Intelligence, one of the Hadoop projects, called Hive, is a data warehouse system for Hadoop compatible file systems (such as Apache HDFS or Apache HBase) and allows query, analysis and creating summary of of big data using a specific query language called Hive-QL.

Data is growing faster than ever and at the moment it doubles every year!  This will become astronomical and out of hand soon as around 80% of this data is Unstructured Data. Projects like Apache Hadoop makes it possible to analyse the Big Data and related projects such as Hive will make equivalent data warehousing for further storage and analysis of relevant data.



On Unstructured Data

Posted by Anahita | Posted in Business Intelligence | Posted on 14-12-2011

Tags: , ,


The name “Unstructured Data” does not somehow define the type of data it refers to.

Generally when organisations use systems and applications, there is a database in the back end and  mainly in “Relational” format.  In a “Relational” database, data is designed to be saved in tables that relate to each other in a way that follow certain rules, called normal forms. This is a database design model that guarantees the users of the corresponding systems, such as ERP systems, to insert, amend and delete data in a quickest way. This is all about performance of applications and the related screens.

But normalised data in relational databases are not very good when the data is to be queried. To solve this problem, the relational data designers use indexes and other methods for querying and displaying the relational data, but use of so many indexes will reduce the performance of the system and so this is not an effective way when reports are required on historical data.

To solve this problem, data often is remodelled as dimensional and saved into another database, usually a data warehouse or a data mart.

All said the data saved in the systems and relational databases are a fraction of the information held in an organisation. Any data that is not saved into a relational or dimensional database, is referred to as “Unstructured Data”, despite the fact that these data may well have structures related to them!

Two examples of  unstructured data  that still have related structure are documents in file system and body of the emails. As certainly there is structure to file systems as well as data related to the information in body of emails, these data cannot be considered aas data with no structured, but still categorised as unstructured data. Other examples of unstructured data are Microsoft Office files, such as Word documents, Excel Spreadsheets, Visio Diagrams, pictures, scans, videos, webcasts, web data including social networks such as facebook and twitter, wikis, web blogs, and any text or picture data saved in any format such as logs.

Statistics shows that less than 20% of data in organisations are relational and so the remaining data is saved and kept outside a relational database and considered as unstructured data. The velocity of growth in unstructured data is faster and the variety and volume is also way higher than the relational data.

Up to now, it was physically impossible to use any sort of analysis on unstructured data due to its volume and variety. This issue is now becoming less of a problem, with new advanced methodologies in distributed computing.

In my next post, I will explain Apache Hadoop and how this can come to rescue to create amazing ways to analyse the “Big Data“.