The theory of Data Warehousing

Data Warehouse in general

How the Business Dimensional Lifecycle can support the development of the Corporate Information Factory

Developing a data warehousing solution like Ralph Kimbal’s Corporate Information Factory (CIF) will, in most cases, be a windy road. In these situations the Business Dimensional Lifecycle (BDL) will support the development of the data warehouse solution by incorporating iterative development. This is one of the, if not the most important, strategy when developing a data warehouse solution for a company. The main reasons for this are that a company’s requirements will change over time, as business changes because of increased insights which make developing a data warehouse more of a process than a project with an end-date.

Another reason why iterative development is beneficial for data warehouse solutions is the shear size of a full-blown data warehouse. In most companies there will be multiple source systems and a variety of stakeholders with different report requirements. Using a traditional waterfall method would make development of a data warehouse a hard, if not impossible task to manage. Next to this the ROI date will be much later in time compared to an iterative approach as an iterative approach will start delivering functionality early on in the development.

A third reason for an iterative approach is that increased insights on the data warehouse project team side, apposed to the business side as mentioned in the former section, can be used in every iteration. These insights can for instance come from performance monitoring or usability feedback from users. Increased insight can also lead to better development or improvement support of metadata in the data warehouse solution.

A second reason why the BDL will support the development of the CIF, apart from the iterative approach argument, is that it is tailored for data warehouse development. The BDL knows not timelines, is knowledge-driven in stead of data-driven and makes the project team focus on the monitoring and tuning performance and usability.

Data warehousing is more a process than a technology

For some of the same reasons why an iterative approach for developing a data warehouse solution will be beneficial, as argued in the former paragraph, it can be argued that data warehousing is more a process than a technology. A data warehouse should be continuously tailored to the changing needs of a company by delivering accurate and reliable information in support of business decisions. Changed reporting requirements can result in the addition of new source systems, the alteration of the subject orientation and the addition/alteration of reports created by the data warehouse. This continues shaping of the data warehouse solution has more similarities with a (continues) process than a sheer technology that is implemented once and changes little over time (like many IT infrastructure components).

A second argument is that a data warehouse is made up of continuously changing operational data that is transformed into a subject oriented, non-volatile and integrated dataset. Because the sources containing the operational data are continuously changing, the data warehouse needs to be loaded periodically which changes the content of the data warehouse. Every load of a data warehouse should incorporate checks on different aspects of data quality and make changes to the load process, the source systems and/or the data warehouse when needed. The periodical and therefore continues nature of the loading and altering the full business intelligence solution also has more similarities with process than a technology.

Why business-ICT alignment is crucial for successful data warehousing

For most businesses is IT not core-business, but a mean that supports their business activities. Data warehousing is no exception, I would even argue that data warehousing can even be one of the most powerful support-means to a business. A remaining issue with the use of IT within businesses is the alignment of IT with the business needs. Both environments are highly changing; within IT new possibilities, techniques and standards appear and disappear rapid; the business environment needs to be able to react quickly to market changes. Because both environments are continuously changing in mostly different directions problems will appear with; tension between coherence and agility; mutual understanding and a common language; sustaining alignment. Most of these problems are summarised by Luftman (2003):

“Achieving and sustaining IT-business alignment continues to be a major issue. Experience shows that no single activity will enable a firm to attain and sustain alignment. There are too many variables. The technology and business environments are too dynamic.” (p.30)

Especially a data warehouse, supporting a business with an aggregated view on the business, needs to be alignment with the business needs. Requirements should be well translated to solutions the data warehouse can offer to the business.  The strength of a data warehouse is to offer information to the business that will provide the business with improved insight in their business doing, with the potential to spot new business opportunities. This can only be achieved if business and IT, the data warehouse specialists, are in continues agreement on where to improve and/or expand the data warehouse to support the ever-changing business needs.

Data Warehouse Architecture

Three important differences and similarities between Bill Inmon’s Corporate Information Factory and Ralph Kimball’s Data Warehouse Bus Architecture

In the field of data warehousing two thought leaders can be distinguished; Bill Inmon and Ralph Kimball. Both Bill Inmon’s Corporate Information Factory (CIF) and Ralph Kimball’s Data Warehouse Bus Architecture (DWBA) have similarities and differences. All aspect of both approaches should be considered to determine the best approach for each organisation.

Differences between both approaches are that Inmon believes in a single normalised data warehouse, where dimensions are an option. This CIF can have data marts, but these are derived from the ‘main’ data warehouse. The granularity to the stored data should be as low as possible to be able to offer every level of detail in the future, when business requirements may change drastically. Ralph Kimball however believes that a data warehouse consist of multiple data marts, say a ‘virtual’ data warehouse. The data marts are mainly dimensional of nature with a granularity that is considered enough for the current and expected business needs.

Similarities in the views of Inmon and Kimball are that both agree on the need for metadata in a data warehouse solution. Meta data can be considered ‘data on data’, which in short is a definition of terms for concepts a data warehouse reports on. They also agree on that a data warehouse should be build using an iterative approach. A data warehouse should be considered as a continues process, the iterative approach supports this by arguing for short term releases and continues evaluation of how the data warehouse fulfils the business requirements. Another similarity is that both agree on the need for data warehouse architecture.

An example in which case the CIF Architecture is a better solution an in which case the Data Warehouse Bus Architecture is a better solution and explain why these are good examples

The Corporate Information Factory (CIF) architecture works best for larger organisations because these organisations tend to consist of many organisational units and therefore many different data sources. In this case will the CIF architecture, with a single ‘main’ data warehouse, offer single place (ETL) where the difficult integration of the many different data sources will be done.

The Data Warehouse Bus Architecture (DWBA) will, in contrast, be a better solution for smaller organisations. These organisations tend to have predictable measuring requirements which are less likely to change often.

Data Warehouse Modeling and ETL

Advantages of a snowflake schema with respect to a star schema

Data warehousing in general knows two schema types; snowflake and star. Two advantages of the snowflake schema will be described in the following paragraphs.

One advantage of the snowflake schema is that this schema contains relatively much less duplicate data because this schema is modelled to store the dimensional tables normalised. This is an inherent issue of the star schema because the central fact table can and mostly will grow very large. A side effect of less duplicate data is that less data needs to be stored as the data volume is smaller.

A second advantage of the snowflake schema is that normalisation improves performance of frequent data warehouse updates like (rapid) insertions. The less beneficial counter effect of normalisation is that query performance will be less because of the higher number of joins.

The difference between an ETL-engine and an ETL-generator

Most data warehouses are dependent on an ETL (extract, transfer and load) to fill the data warehouse with data from source systems. Current ETL’s have become much more that just extracting, transforming and loading data. More advanced tasks like data cleansing, data integration and aggregation have also become the responsibility of the ETL.

In the early days of data warehousing the only way of building an ETL was coding it by hand. As data warehousing grew towards a mature field in IT, software vendors started to supply ETL code generators. This type of ETL-tool is capable of generating ETL scripts based on parameters and configuration. Although this ETL-tool does reduce the time to build and/or change an ETL, this generation is not specifically build for optimisation. Performance of the ETL can therefore be less than the self build ELT´s.

The latest generation of ETL-tools is an EL-engine. This type of ETL-tool offers a configurable ETL framework that allows the data warehouse specialist to make use of predefined functions and methods. The main benefit of this type of ELT-tool is that the software manufacturer has already built the ETL framework, based on best practices taken from data warehouse experts, incorporating performance solutions as well. The inevitable changes to the ETL no longer need long development time on the in-house build framework, but will just be configuration changes most of the time. This ETL-tool is metadata driven and highly optimised.

Why metadata is considered the glue of a data warehouse

Many things can be considered as metadata, from informal metadata like documentation to technical metadata like ETL rules and entity relationships. Metadata can be considered as ‘data that describes data’. Metadata therefore provides contextual information that allows data users or agent to understand information.

Without metadata providing context, information will remain just data which can hardly be used in the decision–making process. This is the exact reason why metadata can be considered the glue of a data warehouse; it provides the context that changes data into valuable information. Information that will be interpreted the same way by all users of the data warehouse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s