One of the most common SSIS Data Flow objects that a SSIS
developer will use is the Lookup Data Flow task. When I’m developing for your
typical data warehouse project where we need to populate the fact tables by
joining the primary surrogate keys to foreign surrogate keys, or primary keys to foreign keys,
we typically (or should always) use integers as surrogate or primarily key. Now
this may be obviously since primary and surrogate keys often are identity type columns and the required data
type is integer. However, there is nothing stopping you from using columns that
are not integer data types as lookup relationships. In the image below, you can see the
typical mapping relationship for SKU (In this example SKU was a Varchar data type)
Naturally, you would then create another data flow task that
will sent matches columns to a slowing changing dimensions (SCD) data flow task, and if you’re
using the SCD data flow task provided by SQL 2012, then your likely going to
use the same column as your business key for the SCD wizard. This would subsequently create the OLE DB Command that uses the selected business key as the joining
column. This is where the problem starts because you are now joining non-integer
types in SSIS. This will become a major problem when you start to use lots of
data as the updates will be extremely slow. The image below points to the Where statement that should only use integers
Now I’m not saying using non-integer data types won’t work,
I’m, saying that it will work SLOWLY compared to integers. For example, just
updating 65K rows had taking about 6 minutes compared to less than 10 seconds in
my labs. Yes, it’s really a night and day difference!
Basically, avoid using alphanumeric/strings (or essentially any
non-integer data type) for lookups relationships when possible. This is particularly true
when using OLE DB Command (for your SCD operations) to update tables that match
when doing the lookup. You'll find DBAs that will say it makes no difference. However, when you'll joining fact tables that could contain millions of rows, using integers makes the big difference.
No comments:
Post a Comment