Surrogate key is term mostly used in data warehouse environments. We might not come by this term in application production databases simply because there is no need to generate these keys.
We can do just fine with primary keys. It is much needed to create surrogate keys in data warehouses for many reasons, read further.
What is Surrogate Key?
Surrogate keys are primary and foreign keys generated on the data warehouse level instead of being taken out of the source system (business key). Primary (business) key on the source can be e.g. customer number, number of order or product etc.
There are many options how to generate surrogate key:
- IDENTITY – option where the key is generated automatically. It is numeric format starting from zero. Data types can be INTEGER (TINYINT, SMALLINT, INT, BIGINT) or NUMERIC, respectively DECIMAL
- SEQUENCE – another way of generating primary/surrogate key is by creating sequence-type object in a database. As with IDENTITY, SEQUENCE can use data type INT, NUMERIC or DECIMAL. This type opf object can provide sequential numbers on demand.
- NONSEQUENTIAL GRID – yet another option is to generate unique identifier of nonsequential GUID type of UNIQUEIDENTIFIER data type using function NEWID()
- SEQUENTIAL GUID – sequential GUID can be generated using T-SQL function NEWSEQUENTIALID()
- Other solution – last possible way is to use some primary key generator or create your own logic
Why is it useful to use surrogate keys in data warehouse?
1) Surrogate keys are independent on changes in source systems
When building a data warehouse it is good to know that it is long-term and strategical matter. It can and most probably will overpass lifetime of some source system. It is also not good to create dependencies between data warehouse and primary system.
2) Surrogate keys have lower storage requirements
Regarding surogate keys, I mostly use IDENTITY AND SEQUENCE. I am not a fan of GUID surrogate keys for performance reasons:
- UNIQUEIDENTIFIER = 16 bytes
- INT = 4 bytes
- BIGINT = 8 bytes
3) JOIN optimization using surrogate keys
It is crucial for data warehouse to have their architecture optimized mainly for queries. We use multiple techniques as schematic type to achieve that: (star scheme, snowflake scheme), structure of fact and dimensional tables etc. One of the influencers of data warehouse performance is also structure of referential integrity – e.g. primary and foreign keys. Surrogate keys of IDENTITY AND SEQUENCE type are extremely worth using.
4) Possibility of using SCD (Slowly changing dimensions)
If we use natural or business key from primary system as a primary key we complicate possibility of record historization application = application SCD (type 2 and higher), so-called slowly changing dimension. The principle of historization is creation of a new record in data wareouse in case when change of some attribute is made on the source system.
5) Unknown value and Surrogate key
You will probably find unknown value as “NULL” in the primary system. It however should not appear in precisely designed data warehouses and data marts. It is suitable to replace NULL values via Surrogate keys. Using surrogate keys in data warehouse comes with yet another advantage – NULL values can be further divided into:
- Empty value being error condition (-1)
- Empty value being allowed value (-2)
Interesting article on Surrogate keys written by Ralph Kimball can be found on Kimball Group web, here