What is a Data Warehouse ?

Essentially a Data Warehouse is:

  1. A database that exists to support reporting and analysis in an organisation.
  2. It contains historical, clean and integrated data from multiple source systems.
  3. The data is structured in a “star schema”, that makes it easy to consume by Business Intelligence tools for analysis and reporting purposes.
  4. It is optimized to respond to queries over large data sets quickly.

At the heart of a data warehouse is the ability to organize data so it can be easily consumed by report writers. What do I mean by easily consumed? Essentially the ability for business users to drag and drop/slice and dice data attributes and measures in a pivot table or graph. No code required.

Other use cases include Data analysts and scientists interacting with a data warehouse in the same way or more directly via code (SQL, R, Python etc).

Recently the term “Data Warehouse” has become blurred. With the advent of big data, data lakes, data swamps etc, vendors will often refer to these different data management architectures as a “Data Warehouse”. For the purposes of this course, the term Data Warehouse means a Kimball Multi-Dimensional Data Warehouse. That is, a database containing Facts and Dimensions organised in a Star Schema. That is the core of the Data Warehouse, but there are other data layers and processes that accompany a Kimball Data Warehouse and are vital to its operation, including:

  • The intermediary data layers (like a Persist layer or a Staging layer)
  • The Extract Transform and Load (ETL) processes required to load data into the Data Warehouse.
  • A semantic layer like an OLAP Cube or Tabular model that provides high performance aggregated query performance for end users.