Fabric | dbt – How I Build Gold Layer Dimensional Tables (SCD2) in Data Projects

Table of contents

1. What is addressed in the Gold layer and our priorities?

2. Implementation in dbt – Incremental and Full Refresh Processing

4. Generating Surrogate Keys Using Custom Macro – SHA2_256

In the previous article Slowly Changing Dimension (SCD 2) – Snapshots I demonstrated how to implement historical tracking of dimensional data in the Silver layer within Fabric using dbt snapshots. These snapshots are already prepared and provide a complete history of dimensional changes (using the valid_from and valid_to attributes). In the Gold layer, however, we need to further process this data so that:

Dimension tables are fully prepared for reporting purposes
Processing is fast and reliable
The data model is resilient to human failures when working with the model

What is addressed in the Gold layer and our priorities?

In the Gold layer, we will be creating dimensional tables – a relatively intensive process, but the reward is a robust and reliable data model.

Creating well-formatted SQL scripts
Generating unique surrogate keys using deterministic methods – HASHBYTES(‘SHA2_256’) function
Creating dummy values (SK -1) to ensure the possibility of INNER JOINs across the model, as we consistently implement fallback -1 values
Setting sentinel values for the temporal validity of dimensions
Converting raw Silver data into the final format (casts, transformations)
Managing incremental recalculation of data during normal updates and supporting full refresh recalculation.

This process produces robust dimensional tables that can be safely operated. Thanks to dummy records, junior colleagues and power users do not need to worry about join types (LEFT/RIGHT/…) and can easily join tables in the Gold model using INNER JOINs without troubles.

Implementation in dbt – Incremental and Full Refresh Processing

The script below illustrates the implementation of:

dimension d_aw_address (destination)
from the historized snapshot st_aw_address (source)

It uses an incremental strategy with delete+insert mode, meaning that at each run, records corresponding to changes are first deleted and then the new versions are inserted. When dbt is run normally, the incremental code is active in the WHERE clause – only records that have changed since the last Gold table update (dbt_updated_at) are loaded from Silver.

The delete+insert strategy seems safer and more resilient to data integrity collisions compared to the merge strategy.

Code below:

{{ config(
    materialized='incremental',
    unique_key='sk_aw_address',
    incremental_strategy='delete+insert',
    schema='03_gold',
    alias='d_aw_address'
) }}

WITH snapshot_table AS (
	SELECT 
        {{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system', 'driving.dbt_valid_from']) }} AS sk_aw_address,
        {{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system']) }} AS sk_aw_address_master,
        driving.source_system                                                   AS dbt_id_sourcesystem,
        driving.source_businesskey                                              AS dbt_id_business_key,
        CAST(
            CASE 
                WHEN driving.dbt_valid_to IS NULL THEN 1
                ELSE 0
            END AS BIT)                                                 	AS dbt_current_flag,
        CAST(driving.dbt_valid_from AS DATE)                                    AS dbt_valid_from,
        CAST(COALESCE(driving.dbt_valid_to, '3000-12-31') AS DATE)              AS dbt_valid_to,
        CAST(driving.dbt_valid_from AS DATE)                                    AS mod_valid_from, 
        CAST(CASE
                WHEN driving.dbt_valid_to IS NULL 
                THEN '3000-12-31'
             ELSE DATEADD(DAY, -1, driving.dbt_valid_to)
        END AS DATE)                                                            AS mod_valid_to,
		CAST(driving.dbt_updated_at AS DATE)                            AS dbt_updated_at,
        CAST(ModifiedDate AS DATE)                                              AS sk_modifieddate,
        City                                                                    AS city,
        AddressLine1                                                            AS address_line_1,
        AddressLine2                                                            AS address_line_2,
        StateProvince                                                           AS state_province,
        PostalCode                                                              AS postal_code,
        CountryRegion                                                           AS country_region
	FROM {{ ref('st_aw_address') }} driving
    WHERE 1=1
    {% if is_incremental() %}
        AND driving.dbt_updated_at > (
        SELECT MAX(dbt_updated_at)
        FROM {{ this }}
        WHERE dbt_id_business_key != 'N/A'
        )
    {% endif %}

    UNION ALL

	SELECT 
        -1                                                                      AS skh_aw_address,
        'N/A'                                                                   AS dbt_id_sourcesystem,
        'N/A'                                                                   AS dbt_id_business_key,
        1                                                                       AS dbt_current_flag,
        CAST('1900-01-01' AS DATE)                                              AS dbt_valid_from,
        CAST('3000-12-31' AS DATE)                                              AS dbt_valid_to,
        CAST('1900-01-01' AS DATE)                                              AS mod_valid_from,
        CAST('3000-12-31' AS DATE)                                              AS mod_valid_to,
		CAST('1900-01-01' AS DATE)                                      AS dbt_updated_at,
        CAST('1900-01-01' AS DATE)                                              AS skh_modifieddate,
        'N/A'                                                                   AS city,
        'N/A'                                                                   AS address_line_1,
        'N/A'                                                                   AS address_line_2,
        'N/A'                                                                   AS state_province,
        'N/A'                                                                   AS postal_code,
        'N/A'                                                                   AS country_region
)
SELECT 
        sk_aw_address,
        dbt_id_sourcesystem,
        dbt_id_business_key,
        dbt_current_flag,
        dbt_valid_from,
        dbt_valid_to,
        mod_valid_from,
        mod_valid_to,
        dbt_updated_at,
        sk_modifieddate,
        city,
        address_line_1,
        address_line_2,
        state_province,
        postal_code,
        country_region
FROM snapshot_table
WHERE 1=1

Dimensional Table Model Logic

The d_aw_address model creates the final address dimension from historical snapshots stored in the Silver layer. The main implementation principles are:

Surrogate keys – The primary key sk_aw_address is generated based on a combination of the business key, source system, and valid_from date. This triple combination ensures uniqueness for each historical version in any table – because the metadata is correctly prepared (see Fabric – Bronze: Data ingestion into Delta tables using pipeline (notebook)).
Current record – Marked with dbt_current_flag = 1. This flag is set based on whether dbt_valid_to is NULL – indicating the latest known version of the record.
Sentinel values – For records with open intervals, dbt_valid_to is replaced with 3000-12-31. Similarly, 1900-01-01 is used as a technical start date.
Incremental mode – Using is_incremental(), only new or changed records are loaded based on dbt_updated_at, ensuring that only updated Silver records are processed.
Full refresh mode – During a full refresh (dbt run --full-refresh), the model is fully regenerated, eliminating inconsistencies after major source changes.
Dummy record “N/A” – At the end of the script, a record with “N/A” values is added as a technical placeholder for artificial foreign keys.

Generating Surrogate Keys Using Custom Macro – SHA2_256

To ensure uniqueness for each historical version, we use a custom macro to generate surrogate keys. This macro allows flexible combination of any list of columns into a hash used as a primary and master key.

{% raw %}{% macro generate_custom_surrogate_key(list_of_columns) %}
{%- set concatenated_string -%}
{%- for column in list_of_columns -%}
CAST({{ column }} AS VARCHAR(MAX))
{%- if not loop.last -%} + '_' + {%- endif -%}
{%- endfor -%}
{%- endset -%}
HASHBYTES('SHA2_256', {{ concatenated_string }})
{% endmacro %}{% endraw %}

Macro Description

Flexible input – Accepts a list of columns to include in the surrogate key
Text conversion – Each column is cast to VARCHAR(MAX) to safely combine values of different data types (Bronze and Silver types are already converted to STRING to prevent type conversion issues)
Value separation – Values are concatenated with underscores to avoid collisions between different combinations
Hashing – The concatenated string is processed using HASHBYTES(‘SHA2_256’) to generate a 256-bit hash used as the surrogate key

In the d_aw_address model, the macro is used to generate surrogate keys for historical and current records:

{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system', 'driving.dbt_valid_from']) }} AS sk_aw_address

This ensures each address version is uniquely identified by the combination of business key, source system, and record start validity. This triple combination prevents duplicates in any table across the model.

The second macro call generates a stable master key (for historical versions with the same business key), important for SCD 1 models:

{{ generate_custom_surrogate_key(['driving.source_businesskey', 'driving.source_system']) }} AS sk_aw_address_master

Macro Benefits

Consistent keys across all models
Minimized risk of historical record collisions
Flexible input columns for different dimensions
Easy maintenance – updating key generation logic requires modifying only the macro

Key advantage for SCD 1 – Deterministic foreign key generation eliminates the need for physical joins to dimensions when building fact tables (or other dimensional tables with foreign keys). This removes data dependencies between dimensions and facts, allowing models to be processed asynchronously or in parallel without strict orchestration in dbt runs. The dbt dependency graph (ref()) can therefore be limited to logical relationships rather than execution order.

Conclusion

Implementing SCD Type 2 using Silver layer snapshots and extending them to the Gold layer via an incremental dbt model is a proven and stable method for managing historical dimensions. This approach ensures full control over temporal validity, unique identification of historical versions, and efficient incremental reloads without needing to rebuild the entire warehouse on each change.

Surrogate key generation using hashes is a proven approach, providing additional benefits for SCD 1 dimensions, where dependency orchestration is entirely eliminated and artificial foreign keys can be generated independently of the source table.

5/5 - (1 vote)