Saturday, May 17, 2014

SSIS Lookups with Integers vs. non-integers


One of the most common SSIS Data Flow objects that a SSIS developer will use is the Lookup Data Flow task. When I’m developing for your typical data warehouse project where we need to populate the fact tables by joining the primary surrogate keys to foreign surrogate keys, or primary keys to foreign keys, we typically (or should always) use integers as surrogate or primarily key. Now this may be obviously since primary and surrogate keys often are identity type columns and the required data type is integer. However, there is nothing stopping you from using columns that are not integer data types as lookup relationships. In the image below, you can see the typical mapping relationship for SKU (In this example SKU was a Varchar data type)



Naturally, you would then create another data flow task that will sent matches columns to a slowing changing dimensions (SCD) data flow task, and if you’re using the SCD data flow task provided by SQL 2012, then your likely going to use the same column as your business key for the SCD wizard. This would subsequently create the OLE DB Command that uses the selected business key as the joining column. This is where the problem starts because you are now joining non-integer types in SSIS. This will become a major problem when you start to use lots of data as the updates will be extremely slow. The image below points to the Where statement that should only use integers



Now I’m not saying using non-integer data types won’t work, I’m, saying that it will work SLOWLY compared to integers. For example, just updating 65K rows had taking about 6 minutes compared to less than 10 seconds in my labs. Yes, it’s really a night and day difference!
Basically, avoid using alphanumeric/strings (or essentially any non-integer data type) for lookups relationships when possible. This is particularly true when using OLE DB Command (for your SCD operations) to update tables that match when doing the lookup.  You'll find DBAs that will say it makes no difference. However, when you'll joining fact tables that could contain millions of rows, using integers makes the big difference.

No comments: