New managed Incremental Extract Pattern

[et_pb_section admin_label=”section”][et_pb_row admin_label=”row”][et_pb_column type=”4_4″][et_pb_text admin_label=”Text” background_layout=”light” text_orientation=”left” text_font_size=”14″ use_border_color=”off” border_color=”#ffffff” border_style=”solid”] Dimodelo Data Warehouse Studio now allows you to extract only changed data from your source system, reducing load on your source systems and overall ETL duration. I.e. To do change data capture (CDC) at the source. Dimodelo Data Warehouse Studio now includes a managed Incremental Extract pattern. The pattern generates code that extract only changed data from your source databases. E.g. you can use this pattern to extract only transactions from a large transaction table that have been inserted or updated since the last time the extract was executed. For the pattern to work you need to identify a column or multiple columns that the source system uses to track change. These columns are called ‘Change Data Identifier’ columns.  They would usually be a modified date or sequential identifier. Each time the Extract is executed Dimodelo saves in the control Parameters table the maximum value that was extracted for the ‘Change Data Identifier’ column(s). Subsequent executions of the Extract only retrieve data from the source where the ‘Change Data Identifier’ column(s) value in the source is greater than the maximum value stored by the previous invocation of the Extract. This is the mechanism Dimodelo uses to manage your incremental extracts.

How to use this pattern

For a Staging table, in the ETL dialog, select the ‘Incremental Extract’ pattern for the Extract. See below (click to enlarge ). [/et_pb_text][et_pb_image admin_label=”Incremental Extract Pattern” src=”https://www.dimodelo.com/wp-content/uploads/Incremental-Extract-Pattern.jpg” alt=”Incremental Extract Pattern” show_in_lightbox=”on” url_new_window=”off” animation=”off” sticky=”off” align=”center” force_fullwidth=”off” always_center_on_mobile=”on” use_border_color=”off” border_color=”#ffffff” border_style=”solid” title_text=”Incremental Extract Pattern”] [/et_pb_image][et_pb_text admin_label=”Text” background_layout=”light” text_orientation=”left” text_font_size=”14″ use_border_color=”off” border_color=”#ffffff” border_style=”solid”] It is possible to define the source of the extract as either a table/view or as a query. To define the incremental extract for a table/view, first select the table/view in the ‘Table or View’ drop down. See below (click to enlarge ). [/et_pb_text][et_pb_image admin_label=”Incremental Extract Table Method” src=”https://www.dimodelo.com/wp-content/uploads/Incremental-Extract-Table-Method.jpg” show_in_lightbox=”on” url_new_window=”off” animation=”off” sticky=”off” align=”center” force_fullwidth=”off” always_center_on_mobile=”on” use_border_color=”off” border_color=”#ffffff” border_style=”solid” alt=”Incremental Extract Table Method” title_text=”Incremental Extract Table Method”] [/et_pb_image][et_pb_text admin_label=”Text” background_layout=”light” text_orientation=”left” text_font_size=”14″ use_border_color=”off” border_color=”#ffffff” border_style=”solid”] Then in ‘Change Data Identifier’ table, select the column(s) that are the ‘Change Data Identifier’ columns. You can select any column in the table/view. It’s also necessary to supply an ‘Initial Load Value’. The ‘Initial Load Value’ is used in the Extract the first time it runs. The idea is to set this to a very low value, so that the initial run of Extract does a Full Extract, selecting all rows in the source. You can tailor this to your needs. For example you may only want rows for the past 3 years. If you have more complex Extract requirements, it may be necessary to write a query. In this case it is still possible to do an incremental Extract. If you have more complex Extract requirements, it may be necessary to write a query. In this case it is still possible to do an incremental Extract. See below (click to enlarge ). [/et_pb_text][et_pb_image admin_label=”Incremental Extract Query Method” src=”https://www.dimodelo.com/wp-content/uploads/Incremental-Extract-Query-Method.jpg” alt=”Incremental Extract Query Method” title_text=”Incremental Extract Query Method” show_in_lightbox=”on” url_new_window=”off” animation=”off” sticky=”off” align=”center” force_fullwidth=”off” always_center_on_mobile=”on” use_border_color=”off” border_color=”#ffffff” border_style=”solid”] [/et_pb_image][et_pb_text admin_label=”Text” background_layout=”light” text_orientation=”left” text_font_size=”14″ use_border_color=”off” border_color=”#ffffff” border_style=”solid”]

In the query you must refer to your ‘Change Data Identifier’ columns as @ChangeDataIdentifierColumnName where @ChangeDataIdentifierColumnName  is any name you choose. The query should be something like:

SELECT …
FROM …
WHERE ChangeDataIdentifierColumnName > @ChangeDataIdentifierColumnName

However you can write much more sophisticated queries if necessary. Then in ‘Change Data Identifier’ table, select the column(s) that are defined as @ChangeDataIdentifierColumnName in the query and supply an ‘Initial Load Value’ value. The ‘Initial Load Value’ is used in the Extract the first time it runs. The idea is to set this to a very low value, so that the initial run of Extract does a Full Extract, selecting all rows in the source. You can tailor this to your needs. For example you may only want rows for the past 3 years.

How the generated code works

As stated previously, each time the Extract is executed Dimodelo saves in the control Parameters table the maximum value that was extracted for each of the ‘Change Data Identifier’ column(s). When you define your source as a table/view, Dimodelo generates a query with the appropriate @ChangeDataIdentifierColumnName placeholders for each ‘Change Data Identifier’. The query would be something like

SELECT *
FROM Table
WHERE ChangeDataIdenitifierColumnName > @ChangeDataIdentifierColumnName

This query is used as the query for the source component of the SSIS package data flow. When you define your source as a query, Dimodelo just sets the query in the source component of the SSIS package data flow to the provided query. Before SSIS package data flow is executed, the package retrieves the maximum value of each ‘Change Data Identifier’ column(s) that was stored in the Control Parameters table by the previous invocation of the Extract.  The maximum value is substituted for the @ChangeDataIdentifierColumnName in the source query prior to the data flow being executed. It is also possible to include Dimodelo’s standard batch control properties (like batch_effective_date, batch_execution_id) as parameters in the query. Just include them as @controlpropertyname. Read the Managing and Monitoring ETL batches Guide for more information about batch control properties. This works for both ODBC sources (which can take a query) and OLE DB sources. This doesn’t work for excel and file sources, by their very nature usually represent a batch of records, or are small in size. If you need to do this with a file source, then connect to it as an ODBC source.

[/et_pb_text][/et_pb_column][/et_pb_row][/et_pb_section]

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.