Tuesday, August 31, 2010

Optimizing the Slowly Changing Dimension Wizard

http://blogs.msdn.com/b/mattm/archive/2010/08/05/optimizing-the-slowly-changing-dimension-wizard.aspx

As a follow-up to my previous post about SCD processing in SSIS, I thought I’d go deeper into using the built-in Slowly Changing Dimension Wizard. While there are definitely more efficient ways to do SCD processing in SSIS, there are some optimizations you can apply to the components that the wizard outputs that might make it more acceptable for your environment.
First, let’s take a look at the Wizard’s strengths. Besides the advantage of not having to write any SQL by hand, there are two key scenarios the SCD Wizard is optimized for:
  1. Small number of change rows - Most ETL best practices say that you should perform change data capture (CDC) at the source (or as close to the source as possible). The Wizard was not designed to process your entire source table (like the Kimball Method SCD component) – for example, it doesn’t detect “deletes”.
  2. Large dimensions – If you are dealing with large dimensions, SCD approaches that read the entire reference table may not be the best approach. For example, if I have 5-10K change rows coming in, and a 2 Million row dimension, the majority of the data flow time will be spent doing a full scan of your source dimension table.
If your scenario doesn’t match the above, you might want to consider just using one of the alternate approaches directly. If it does, or if you don’t want to use any custom components (or hand craft the SQL required for a MERGE statement, for example), consider making the following optimizations:
SCD Wizard output

Slowly Changing Dimension transform (Red)

The first transform does not cache any lookup results from the reference dimension, so every incoming row results in a query against the database. By default, the wizard will open a new connection to the database on each query. For a performance gain (and less resource usage), you can set the RetainSameConnection property of the wizard’s connection manager to True to re-use the same connection on each query.

OLE DB Command transforms (Green)

The wizard will output three separate OLE DB Command transforms (which perform row-by-row updates). You will get a big performance boost by placing the rows in a staging table, and performing the updates in a single batch once the data flow completes. Another option is to use the Batch Destination component, which is available on Codeplex.

OLE DB Destination (Blue)

The default destination that the Wizard outputs will have Fast Load disabled (to avoid locking issues). In many cases (mostly depending on the number of rows you’re processing), you can enable Fast Load for an immediate performance gain. To avoid potential deadlocking issues, another option is to use a staging table, and move the data over to the final destination table once the data flow is complete using a INSERT INTO … SELECT statement.
Using the above optimizations, I was able to bring down the processing time of a 200k row change set (against a 100k row dimension table) from 60 minutes to 14 minutes. Note however, that processing a similar change set using the SQL MERGE statement took under 3 minutes to complete.
 image
I go into more detail about these optimizations (and the difference between SCD processing approaches) in my Performance Design Patterns talk.

Handling Slowly Changing Dimensions in SSIS

http://blogs.msdn.com/b/mattm/archive/2009/11/13/handling-slowly-changing-dimensions-in-ssis.aspx

I had a great time at PASS last week, and had a chance to talk to a lot of different SSIS users. One of the big topics seemed to be Slowly Changing Dimensions – I had a number of people ask for various improvements to the current Slowly Changing Dimension Transform in SSIS, and also ask for recommended alternatives in the meantime. I thought I’d summarize some of the more popular approaches I’ve seen, and see if anyone else has some alternatives.

Slowly Changing Dimension Wizard

You might have already tried the Slowly Changing Dimension Wizard that comes with SSIS 2005 and 2008 (and there are a number of good tutorials out there if you haven’t).
Outputs from the Slowly Changing Dimension Wizard
The SCD Wizard has a few things going for it - it’s quick and easy to implement, it handles most SCD scenarios out of the box, and its multi-component approach means you can customize it with the functionality you need.
It does have some pretty big limitations, however, which end up being a deal breaker for a lot of people.
A major inhibitor is the performance of the transform. It doesn’t perform that well for a couple of different reasons:
  1. The data lookups are not cached – each row results in a SQL Query
  2. OLE DB Command does row by row updates
  3. OLE DB Destination added by the wizard doesn’t use FastLoad by default (which can be easily changed in the OLE DB Destination editor UI)
Another downside to the transform is the “one way” nature of the wizard – running it again (to change columns, for example) means you’ll lose any customizations you might have made to the other transforms.
I recommend using the wizard for simple dimensions, where you’re not processing a lot of data. If performance is a concern, consider one of the following approaches.

Using MERGE

I came across this tip from the Kimball Group when I was putting together my Merge & CDC talk last year.
Using the SQL MERGE Statement for Slowly Changing Dimension Processing
In this approach, you write all of your incoming data to a staging table, and then use Execute SQL Tasks to run MERGE statements (you actually have to do two passes – one for Type 1 changes, and one for Type 2 – see the details in the tip above). I posted the sample packages and code I used in a previous blog post.
The performance in this approach is very good (although it moves the bulk of the work to the database machine, which might not be what you want). I recommend it if you don’t mind staging the data, writing custom SQL, or can’t use a 3rd party component in your environment.

Kimball Method SSIS Slowly Changing Dimension Component

I’ve heard great things about Todd McDermid’s custom SCD Transform. Instead of doing row by row lookups, this transform takes in the dimension as an input. This makes the comparison much faster than the stock SSIS version. It wraps up all of the functionality into a single transform, which is great if you’re following the Kimball methodology.

Table Difference Component

I had the chance to meet with the SQLBI.EU guys at PASS, and they mentioned their Table Difference component. I haven’t tried it out myself, but I remembered an email from one of the SQL Rangers (Binh Cao) that suggested this component for SCD processing. I’ve included his write-up here:
Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.
The component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.
clip_image001
  • Unchanged rows (are the data rows that are the same in both inputs)
  • Deleted rows (are the data rows that appear in old source but not in new source)
  • New rows (are the data rows that appear in new source but not in old source)
  • Updated rows (are the data rows that appear in both flows but something is changed)
The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.
Clicking on TableDifference control gives the following window.
clip_image002
TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.
If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.
All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.
Using the component editor, you need to provide the following information for the columns:
Check Option: you can choose the column type between:
  • Key field:  column will be used to detect when two rows from the inputs represent the same row. Beware that the inputs must be sorted by those columns
  • Compare:  column will be compared one by one to detect differences
  • Prefer NEW: columns will be copied from the NEW input directly into the output, no check
  • Prefer OLD: columns will be copied from the OLD input directly into the output, no check
KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.
Update ID: Each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.
clip_image003
Outputs Panel gives option to choose which output to enable as well as to name and describe the output.
clip_image004
Output Details allows selection of the columns for each output.  Here columns that are not needed for an output can be disabled.  The picture shows an example of the DELETED output which only have the Customer Key column in its output.  The less columns in the output, the better the performance of the component.
clip_image005
In the Misc Options tab, string comparisons definition can be defined:
  • The culture ID to use to perform the comparison. If not specified TableDifference will use the culture ID of the running task. The default is “empty”.
  • If you want it to ignore casing during string comparisons. The default is unchecked so TableDifference will perform comparison considering case.
In the Warnings panel, it will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.

BI Study Video

SSIS Performance Design Patterns (video) 

http://blogs.msdn.com/b/mattm/archive/2010/06/29/ssis-performance-design-patterns-video.aspx

Project REAL—Business Intelligence in Practice

https://www.microsoft.com/sqlserver/2005/en/us/project-real.aspx#documentation

Project REAL Documentation

The Project REAL team has provided content to help explain the parameters of the project. The following is a list of the documents available for Project REAL: