Fact Performance and Partitions

Use the defaultBufferSize setting to improve the throughput of SSIS package.

Key things you need to know:

  1. Each component in an SSIS data flow is assigned a memory buffer. The default is about 10MB.
  2. When your source is a wide table (i.e. has many or large columns), each row takes up a large amount of memory.
  3. If you have a wide source table then the throughput is constrained to the number of rows that SSIS can fit into the buffer.
  4. If you increase the buffer size you improve throughput.

In SSIS packages generated by Dimodelo Data Warehouse Studio ( A Data Warehouse Automation tool for the Microsoft SQL Server platform), DefaultBufferMaxRows is set 30000 and DefaultBufferSize to 31457380.

To change the defaults on each SSIS package Dimodelo Data Warehouse Studio generates, do the following:

On the:

  •  Advanced tab in the Extract dialog of the Staging table editor
  • or ETL dialog of the Dimension/Fact editors

Add the tags shown in the image below (click to expand)

maxbuffersize

In this example the DefaultBufferMaxRows is set 30000 (same as default for Dimodelo generated packages), and the DefaultBufferSize is increased to 94372140 (about 90MB).

Reference: Data Flow Performance

Active Partition

One of the benefits of Data Warehousing is to remove the burden of keeping historical data from operational systems. However if an operational system does not keep all history, this presents an issue for this type of pattern which implements internal change data capture. How does the pattern differentiate between a legitimate delete in the source vs records that are just no longer being kept? To address this issue, the pattern introduces the concept of an active partition period. The active partition period is a time period (starting with the current batch effective date and working back a number of days) where data in the source is still considered to be active, and can be changed or deleted. Only changes and deletes in this time period would be acted upon. Data that appears to be deleted in the source system prior to this period will not be touched, thus maintaining the history. Deletes are implemented as ‘soft deletes’ simply having their current flag set to ‘not current’. We recommend that you select as large a period as you can to ensure all changes are captured. The patterns perform fast enough to handle bulk data. This pattern can be used with the Date Range Extract Pattern to also limit the data extracted from the Source System for greater efficiency.

The pattern will work with or without an active partition. The active partition is only necessary if the source system is not keeping all historical data. To enable an active partition add the following custom meta data tag ‘DIL_Active_Partition_Period_Days’ to the tag table of the advanced tab of the Fact transformation dialog. Set the value of the tag to the number of days in the active partition. E.g. 365.

Also, the date column of the Fact that is used to determine if a row exist within the active partition must also be tagged. To tag the date column, add the tag ‘DIL_Date_Attribute’ to the Fact Editor>ETL Dialog>Transforms>Edit Mapping>Tag Table. Set both the Tag Name and Value to ‘DIL_Date_Attribute’.

The above setting generate an SQL statement in the SSIS component that selects records from the fact for comparison to the incoming staging table rows:

DECLARE @DIL_Active_Partition_Start_Date datetime = DATEADD(DAY,60* -1, ?)
SELECT
— surrogate key
fact_TimeSheet_sKey,
— business key
[TimeRecordId] as [TimeRecordId_For_Merge_Join],

— checksum management column
[Checksum] as fact_Checksum,
— row is current
Row_Is_Current
FROM fact.[fact_TimeSheet] WHERE
Date >= @DIL_Active_Partition_Start_Date

ORDER BY
[TimeRecordId]

The ? is replaced with the Current_batch_Effective_Date stored in the Control_Properties table in the data warehouse. Current_batch_Effective_Date is updated every time the batch runs to midnight of the day before the current date.

So the query extracts all data in the fact >=  Current_batch_Effective_Date – 60 days.

This pattern is designed to be used in conjunction with the staging table Date Range Extract Pattern to improve performance, or cater for sources that don’t keep all history. Ensure that the ‘Extract_Period_Days’ tag in the Staging table is set to a value equal the Active Partition Period Days.