Tuesday, August 31, 2010

Handling Slowly Changing Dimensions in SSIS

http://blogs.msdn.com/b/mattm/archive/2009/11/13/handling-slowly-changing-dimensions-in-ssis.aspx

I had a great time at PASS last week, and had a chance to talk to a lot of different SSIS users. One of the big topics seemed to be Slowly Changing Dimensions – I had a number of people ask for various improvements to the current Slowly Changing Dimension Transform in SSIS, and also ask for recommended alternatives in the meantime. I thought I’d summarize some of the more popular approaches I’ve seen, and see if anyone else has some alternatives.

Slowly Changing Dimension Wizard

You might have already tried the Slowly Changing Dimension Wizard that comes with SSIS 2005 and 2008 (and there are a number of good tutorials out there if you haven’t).
Outputs from the Slowly Changing Dimension Wizard
The SCD Wizard has a few things going for it - it’s quick and easy to implement, it handles most SCD scenarios out of the box, and its multi-component approach means you can customize it with the functionality you need.
It does have some pretty big limitations, however, which end up being a deal breaker for a lot of people.
A major inhibitor is the performance of the transform. It doesn’t perform that well for a couple of different reasons:
  1. The data lookups are not cached – each row results in a SQL Query
  2. OLE DB Command does row by row updates
  3. OLE DB Destination added by the wizard doesn’t use FastLoad by default (which can be easily changed in the OLE DB Destination editor UI)
Another downside to the transform is the “one way” nature of the wizard – running it again (to change columns, for example) means you’ll lose any customizations you might have made to the other transforms.
I recommend using the wizard for simple dimensions, where you’re not processing a lot of data. If performance is a concern, consider one of the following approaches.

Using MERGE

I came across this tip from the Kimball Group when I was putting together my Merge & CDC talk last year.
Using the SQL MERGE Statement for Slowly Changing Dimension Processing
In this approach, you write all of your incoming data to a staging table, and then use Execute SQL Tasks to run MERGE statements (you actually have to do two passes – one for Type 1 changes, and one for Type 2 – see the details in the tip above). I posted the sample packages and code I used in a previous blog post.
The performance in this approach is very good (although it moves the bulk of the work to the database machine, which might not be what you want). I recommend it if you don’t mind staging the data, writing custom SQL, or can’t use a 3rd party component in your environment.

Kimball Method SSIS Slowly Changing Dimension Component

I’ve heard great things about Todd McDermid’s custom SCD Transform. Instead of doing row by row lookups, this transform takes in the dimension as an input. This makes the comparison much faster than the stock SSIS version. It wraps up all of the functionality into a single transform, which is great if you’re following the Kimball methodology.

Table Difference Component

I had the chance to meet with the SQLBI.EU guys at PASS, and they mentioned their Table Difference component. I haven’t tried it out myself, but I remembered an email from one of the SQL Rangers (Binh Cao) that suggested this component for SCD processing. I’ve included his write-up here:
Table difference is an SSIS custom component designed to simplify the management of slowly changing dimensions and – in general – to check the differences between two tables or data flow with SSIS.
The component receives input from two sorted sources and generates different outputs for unchanged, new, deleted or updated rows.
clip_image001
  • Unchanged rows (are the data rows that are the same in both inputs)
  • Deleted rows (are the data rows that appear in old source but not in new source)
  • New rows (are the data rows that appear in new source but not in old source)
  • Updated rows (are the data rows that appear in both flows but something is changed)
The inputs MUST be sorted and have a collection of fields (keys) that let the component decide when two rows from the inputs represent the same row, but this is easily accomplished by SQL Server with a simple “order by” and a convenient index; moreover the SCD do normally maintain an index by the business key, so the sorting requirement is easily accomplished and do not represent a problem.
Clicking on TableDifference control gives the following window.
clip_image002
TableDifference analyzes all the columns in both inputs and compares their names. If the name of two columns and their corresponding types are identical, TableDifference adds them to the available columns to manage.
If the flows are sorted, their sort columns will be marked as key fields, using the same order in which they appear in the sort.
All other columns are assigned a standard Update ID of 10 and are managed as comparable columns.
Using the component editor, you need to provide the following information for the columns:
Check Option: you can choose the column type between:
  • Key field:  column will be used to detect when two rows from the inputs represent the same row. Beware that the inputs must be sorted by those columns
  • Compare:  column will be compared one by one to detect differences
  • Prefer NEW: columns will be copied from the NEW input directly into the output, no check
  • Prefer OLD: columns will be copied from the OLD input directly into the output, no check
KeyOrder: If a column is of type “Key Field” it is the order under which the field appear under the “order by” clause of your query. Beware that the component do not check for the correct sorting sequence, it is up to you to provide this information.
Update ID: Each different UpdateID creates a different output flow. If you need to detect when a change appears in some column you can use different update ID. Beware that the lowest update ID wins, i.e. if AccountNumber has update id of 10 and AddressLine1 has update id of 20, then Accountnumber will be checked first and if a change is detected, the row will go to update output 10, no matter if AddressLine has a difference.
clip_image003
Outputs Panel gives option to choose which output to enable as well as to name and describe the output.
clip_image004
Output Details allows selection of the columns for each output.  Here columns that are not needed for an output can be disabled.  The picture shows an example of the DELETED output which only have the Customer Key column in its output.  The less columns in the output, the better the performance of the component.
clip_image005
In the Misc Options tab, string comparisons definition can be defined:
  • The culture ID to use to perform the comparison. If not specified TableDifference will use the culture ID of the running task. The default is “empty”.
  • If you want it to ignore casing during string comparisons. The default is unchecked so TableDifference will perform comparison considering case.
In the Warnings panel, it will list any unused column from the input. As you might recall, if two columns are not identical regarding name and type, TableDifference will ignore them. This might be an error but the decision is up to you. By checking the warnings panel you can see if TableDifference is working with all the columns you need it to compare.

1 comment:

James Zicrov said...

Thank you so much for providing information about SSIS and related tools and technicalities which solve a large set of complex problems in one go.

SSIS Postgresql Read