Metadata Driven Data Warehouse (MDW) vs Traditional ETL tools

Here at Dimodelo Solutions, we are passionate about data warehouses and the benefits they can bring an organisation. But we are equally passionate about the art and practice (the how) of building these critical information assets. This is why we developed our Metadata Driven Data Warehouse tool “Dimodelo Data Warehouse Studio” (formally Dimodelo Architect).

What follows is a comparison of the Metadata Driven approach to Data Warehouse development vs the traditional approach taken with an ETL tool.

Introduction to Metadata Driven Data Warehousing

A Metadata Driven Data Warehouse (MDW) tool (Like Dimodelo Data Warehouse Studio) is a new class of ETL Tool, designed to improve developer productivity (amongst other things). It’s a specialized tool dedicated to Data Warehouse development that uses metadata and patterns to deliver data warehouses, combining modelling, ETL and Orchestration. It works at a logical design level, rather than a physical code level. It captures the design of both the Data Warehouse schema and ETL. It combines the captured design with a set of generation templates to create/sync the data warehouse schema and generate ETL code. The generation templates are pre-defined and tested to create consistent, best practice, quality, and comprehensive code.

An ETL tool is a general tool that can be applied to different data scenarios like data Warehouse development, Data Integration, System Interoperability, Data Migration. Being generalist makes it flexible, but this flexibility comes at a cost. It requires more architectural input and time consuming manual development from developers. While many ETL tools have GUI front ends, they are essentially the equivalent of programming code for a specific ETL runtime. They work at the detailed code level rather than a logical design level. Working at this low level is time-consuming. They rely on the skills and experience of the developer to ensure best practice, quality, tested and comprehensive code is developed. A considerable effort needs to be expended on defining coding standards, practices, and patterns and ensuring conformance to standards. Most ETL tools only work with the ETL and not the database schema. This means a coordinated deployment is required to manage the schema and ETL.

Change Propagation

Take a simple example of just changing a column name or data type in a Staging table. This will affect the staging table, the load process that loads the staging table, and any Dimension or Fact transform process that uses the staging table as it’s source.

Using a traditional ETL tool, you need to change each of those code artifacts individually, and then coordinate their deployment to multiple environments. The ETL changes need to be coordinated with the schema changes. This process introduces a lot of effort and potential human error.

With MDW, the modeling and ETL tools are integrated, and the solution is captured in metadata rather than code, any change that is made to the design propagates throughout the solution. Just change the name/type of the column, and regenerate. All code artifacts are simply regenerated to reflect the change and the tool handles the deployment of both the schema change and ETL code eliminating errors.

Portability

Data platforms are continually changing. Change is happening at an ever-increasing pace. Keeping up with that change is difficult. ETL code written today is obsolete in 6 months’ time. ETL gets stuck at a point in time, and usually, the only option is to rewrite it to take advantage of the latest technologies.

Metadata Driven Data Warehousing is different. Because it is not aligned to any one technology, but rather, is captured design at a logical level, moving from one data platform or technology is as simple as taking an existing project and re-targeting to a different platform. The code is then re-generating with a new set of generation templates that target that new platform. The ETL patterns remain largely logically the same, but their physical implementation can be very different, utilizing the strengths of different technologies.

Productivity/Speed

ETL can get complicated quick. Teams can get bogged down in implementation details, and as the body of code grows, having to maintain and refactor existing code becomes a hinderance to progress. Defining ETL at a logical mapping level combined with code generation and dedicated data warehousing tooling that lets you import schema and automates mapping increases productivity by a factor of 3. The technical complexity is encapsulated in the generation templates. Developers work at a higher logical level and don’t get bogged down in writing, deploying, changing and testing ETL code. Change propagation and coordinated deployment mean developers spend more time doing development rather than the admin overhead of releasing code.

Cost

According to research at Wharton University the optimum team size for a complex task requiring coordination is 6

“if companies are dealing with coordination tasks and motivational issues, and you ask, ‘What is your team size and what is optimal?’ that correlates to a team of six. “Above and beyond five, and you begin to see diminishing motivation,” says Mueller. “After the fifth person, you look for cliques. And the number of people who speak at any one time? That’s harder to manage in a group of five or more.”

It’s a bit more nuanced than that. You can read an article about the research here – https://knowledge.wharton.upenn.edu/article/is-your-team-too-big-too-small-whats-the-right-number-2/

Mostly smaller teams are more productive and, of course, cost less.

With MDW you empower each team member with tools that make each individual far more productive. Combine that tool productivity with the inherent productivity benefits of small teams (a luxury MDW enables) and you have the recipe for a highly productive team.

Consistency

Over time data warehouse teams change. Each developer brings their unique approach to ETL code to solve data problems. This can present challenges when the developer leaves or moves on to other tasks. It becomes difficult for others to interpret and understand their code. In the worst cases, it becomes necessary to re-implement.

With MDW the design is captured in a consistent way, prescribed by the tool and architecture adopted. The entire solution is encapsulated in a single solution, easy to find and follow. Even across projects, there is consistency in the approach.

Not to mention the consistency of the code generated by the generation templates that ensure that a small stable of code patterns are judiciously applied to every generated artifact.

Quality

Code quality is essential to a sustainable data warehouse solution. Does your code cater for unexpected events? Does it take an oversimplified approach (e.g. truncate and reload) because other approaches just take too much effort? Is it tested to ensure that every permutation of insert, update, delete and reinstate are catered for? Is it robust? Can it handle duplicates or intra batch changes for example? Can it handle table changes like new columns, or does this mean a reload is required? Is it keeping history? Does it perform? Can it cater for complex transformations? There is a perception that ETL is easy. Just extract data from one place and load it into another. This is what a lot of vendors would like you to believe. The truth is Data Warehouse ETL is complex and requires careful thought. When you examine all the possible use cases, it becomes even more complicated. This complexity also means manually writing robust ETL can be time-consuming and error-prone, even with well thought out patterns.

The MDW approach is to solve all those complex issues once in the generation template and then apply that to many entities in your project. The template is thoroughly tested to ensure it performs and correctly handles the many use cases. You can have confidence that every generated code artifact has the same high levels of quality.

Responsiveness to Business

“The only constant in life is change”

That is true for data and reporting. Business requirements and source systems are continually changing, coming, going and upgrading. All the benefits of MDW make your Data Warehouse team much more responsive to a changing environment. Without it, the team can be very resistant to change which can result in the business looking for other solutions to their immediate problems.

Focus

What does this all boil down too? Is your Data Warehouse team lost in the technology or focusing on solving business problems? Being business-focussed is key to building a powerful information asset like a data warehouse. Businesses that focus on core business concepts, processes, terms and measures are building a consensus that aids understanding, communication and insight throughout the business. Too often this understanding is concentrated with a few employees and can walk out the door when key employees leave. Having your Data Warehouse team lost in the technical detail, drudgery and churn the ETL, means they have already lost the battle to deliver a long term business-focussed information asset. An MDW lifts the team out of the drudgery and focuses them on the business at hand.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.