In the previous article Slowly Changing Dimension (SCD 2) – Snapshots I demonstrated how to implement historical tracking of dimensional data in the Silver layer within Fabric using dbt snapshots. These snapshots are already prepared and provide a complete history of dimensional changes (using the valid_from and valid_to attributes). In the Gold layer, however, we need to further process this data so that:
- Dimension tables are fully prepared for reporting purposes
- Processing is fast and reliable
- The data model is resilient to human failures when working with the model
What is addressed in the Gold layer and our priorities?
In the Gold layer, we will be creating dimensional tables – a relatively intensive process, but the reward is a robust and reliable data model.
- Creating well-formatted SQL scripts
- Generating unique surrogate keys using deterministic methods – HASHBYTES(‘SHA2_256’) function
- Creating dummy values (SK -1) to ensure the possibility of INNER JOINs across the model, as we consistently implement fallback -1 values
- Setting sentinel values for the temporal validity of dimensions
- Converting raw Silver data into the final format (casts, transformations)
- Managing incremental recalculation of data during normal updates and supporting full refresh recalculation.
This process produces robust dimensional tables that can be safely operated. Thanks to dummy records, junior colleagues and power users do not need to worry about join types (LEFT/RIGHT/…) and can easily join tables in the Gold model using INNER JOINs without troubles.
Implementation in dbt – Incremental and Full Refresh Processing
The script below illustrates the implementation of:
- dimension d_aw_address (destination)
- from the historized snapshot st_aw_address (source)
It uses an incremental strategy with delete+insert mode, meaning that at each run, records corresponding to changes are first deleted and then the new versions are inserted. When dbt is run normally, the incremental code is active in the WHERE clause – only records that have changed since the last Gold table update (dbt_updated_at) are loaded from Silver.
The delete+insert strategy seems safer and more resilient to data integrity collisions compared to the merge strategy.
Code below:
{{ config(
materialized='incremental',
unique_key='sk_aw_address',
incremental_strategy='delete+insert',
schema='03_gold',
alias='d_aw_address'
) }}
WITH snapshot_table AS (
SELECT
{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system', 'driving.dbt_valid_from']) }} AS sk_aw_address,
{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system']) }} AS sk_aw_address_master,
driving.source_system AS dbt_id_sourcesystem,
driving.source_businesskey AS dbt_id_business_key,
CAST(
CASE
WHEN driving.dbt_valid_to IS NULL THEN 1
ELSE 0
END AS BIT) AS dbt_current_flag,
CAST(driving.dbt_valid_from AS DATE) AS dbt_valid_from,
CAST(COALESCE(driving.dbt_valid_to, '3000-12-31') AS DATE) AS dbt_valid_to,
CAST(driving.dbt_valid_from AS DATE) AS mod_valid_from,
CAST(CASE
WHEN driving.dbt_valid_to IS NULL
THEN '3000-12-31'
ELSE DATEADD(DAY, -1, driving.dbt_valid_to)
END AS DATE) AS mod_valid_to,
CAST(driving.dbt_updated_at AS DATE) AS dbt_updated_at,
CAST(ModifiedDate AS DATE) AS sk_modifieddate,
City AS city,
AddressLine1 AS address_line_1,
AddressLine2 AS address_line_2,
StateProvince AS state_province,
PostalCode AS postal_code,
CountryRegion AS country_region
FROM {{ ref('st_aw_address') }} driving
WHERE 1=1
{% if is_incremental() %}
AND driving.dbt_updated_at > (
SELECT MAX(dbt_updated_at)
FROM {{ this }}
WHERE dbt_id_business_key != 'N/A'
)
{% endif %}
UNION ALL
SELECT
-1 AS skh_aw_address,
'N/A' AS dbt_id_sourcesystem,
'N/A' AS dbt_id_business_key,
1 AS dbt_current_flag,
CAST('1900-01-01' AS DATE) AS dbt_valid_from,
CAST('3000-12-31' AS DATE) AS dbt_valid_to,
CAST('1900-01-01' AS DATE) AS mod_valid_from,
CAST('3000-12-31' AS DATE) AS mod_valid_to,
CAST('1900-01-01' AS DATE) AS dbt_updated_at,
CAST('1900-01-01' AS DATE) AS skh_modifieddate,
'N/A' AS city,
'N/A' AS address_line_1,
'N/A' AS address_line_2,
'N/A' AS state_province,
'N/A' AS postal_code,
'N/A' AS country_region
)
SELECT
sk_aw_address,
dbt_id_sourcesystem,
dbt_id_business_key,
dbt_current_flag,
dbt_valid_from,
dbt_valid_to,
mod_valid_from,
mod_valid_to,
dbt_updated_at,
sk_modifieddate,
city,
address_line_1,
address_line_2,
state_province,
postal_code,
country_region
FROM snapshot_table
WHERE 1=1
Dimensional Table Model Logic
The d_aw_address model creates the final address dimension from historical snapshots stored in the Silver layer. The main implementation principles are:
- Surrogate keys – The primary key
sk_aw_addressis generated based on a combination of the business key, source system, and valid_from date. This triple combination ensures uniqueness for each historical version in any table – because the metadata is correctly prepared (see Fabric – Bronze: Data ingestion into Delta tables using pipeline (notebook)). - Current record – Marked with
dbt_current_flag = 1. This flag is set based on whetherdbt_valid_tois NULL – indicating the latest known version of the record. - Sentinel values – For records with open intervals,
dbt_valid_tois replaced with3000-12-31. Similarly,1900-01-01is used as a technical start date. - Incremental mode – Using
is_incremental(), only new or changed records are loaded based ondbt_updated_at, ensuring that only updated Silver records are processed. - Full refresh mode – During a full refresh (
dbt run --full-refresh), the model is fully regenerated, eliminating inconsistencies after major source changes. - Dummy record “N/A” – At the end of the script, a record with “N/A” values is added as a technical placeholder for artificial foreign keys.
Generating Surrogate Keys Using Custom Macro – SHA2_256
To ensure uniqueness for each historical version, we use a custom macro to generate surrogate keys. This macro allows flexible combination of any list of columns into a hash used as a primary and master key.
{% raw %}{% macro generate_custom_surrogate_key(list_of_columns) %}
{%- set concatenated_string -%}
{%- for column in list_of_columns -%}
CAST({{ column }} AS VARCHAR(MAX))
{%- if not loop.last -%} + '_' + {%- endif -%}
{%- endfor -%}
{%- endset -%}
HASHBYTES('SHA2_256', {{ concatenated_string }})
{% endmacro %}{% endraw %}
Macro Description
- Flexible input – Accepts a list of columns to include in the surrogate key
- Text conversion – Each column is cast to VARCHAR(MAX) to safely combine values of different data types (Bronze and Silver types are already converted to STRING to prevent type conversion issues)
- Value separation – Values are concatenated with underscores to avoid collisions between different combinations
- Hashing – The concatenated string is processed using HASHBYTES(‘SHA2_256’) to generate a 256-bit hash used as the surrogate key
In the d_aw_address model, the macro is used to generate surrogate keys for historical and current records:
{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system', 'driving.dbt_valid_from']) }} AS sk_aw_address
This ensures each address version is uniquely identified by the combination of business key, source system, and record start validity. This triple combination prevents duplicates in any table across the model.
The second macro call generates a stable master key (for historical versions with the same business key), important for SCD 1 models:
{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system']) }} AS sk_aw_address_master
Macro Benefits
- Consistent keys across all models
- Minimized risk of historical record collisions
- Flexible input columns for different dimensions
- Easy maintenance – updating key generation logic requires modifying only the macro
Key advantage for SCD 1 – Deterministic foreign key generation eliminates the need for physical joins to dimensions when building fact tables (or other dimensional tables with foreign keys). This removes data dependencies between dimensions and facts, allowing models to be processed asynchronously or in parallel without strict orchestration in dbt runs. The dbt dependency graph (ref()) can therefore be limited to logical relationships rather than execution order.
Conclusion
Implementing SCD Type 2 using Silver layer snapshots and extending them to the Gold layer via an incremental dbt model is a proven and stable method for managing historical dimensions. This approach ensures full control over temporal validity, unique identification of historical versions, and efficient incremental reloads without needing to rebuild the entire warehouse on each change.
Surrogate key generation using hashes is a proven approach, providing additional benefits for SCD 1 dimensions, where dependency orchestration is entirely eliminated and artificial foreign keys can be generated independently of the source table.