Tag: dbt

dbt – A quick fix to Unit Test feature with models having CTEs in DBT Fabric adapter
TL;DR

DBT Fabric supports unit test feature but doesn’t support models having Common Table Expression (CTEs). I have setup a test case and found the issue. My solution is to use 2 regular expressions to fix the issue. The first regular expression is used to detect if a model has CTEs, while the second one is to extract the final selection statement and put CTE definition part on top of any generated SQL code in 2 internal macros relating to the unit test feature.

You can find my code in this zipfile on Github at this link

https://github.com/user-attachments/files/16408951/macros.zip

The issue is raised in DBT Fabric Adapter repository on GitHub at this link

https://github.com/microsoft/dbt-fabric/issues/185

The details of my finding is presented in below sections.

DBT Unit Test

DBT has had test feature for a long time and now be classified as data test in version 1.8 and onwards. This is sensible since the data test aims to test quality and uniqueness of columns in a model or the statistics of the whole model. Unit Test is adopted for these reasons:
- Allow to test complex transformative logics in a data model with multiple scenarios without actually deploying it to the warehouse.
- A fail in data test could be the result of a bug of an internal CTE or transformative logic.
With the great benefit, I was super eager to hands on the feature with a test scenario like below.

Note: you are now familiar with unit test, I suggest have a look at PyTest at this link https://docs.pytest.org/en/stable/. You will find some terms will be used by DBT’s unit test feature

Scenario

Test Model

I developed a conform model about Vacancy Referrals with 4 CTEs to find out which staff referred which vacancy to which customer at what time. You can see from the screenshot that the SQL code is pretty long and has some complex logics in the collapsed sections.

Vacancy Referral Model Structure

Fixtures

Fixtures are self-prepared resources for unit tests. They can be configurations, environment variables, or mock data. In DBT unit test, they are mock data. DBT supports inline mock data, or in a CSV file and a SQL file. However, if you wish to test your model, which still not exist and has non-exist related models/views/seeds in the warehouse, you would have to choose SQL.

According to DBT doc site, all fixtures must be put into fixtures folder in tests folder, which is configurable if you don’t want to stick with that standard name.

Unit Test YML

I created a YML file declaring all unit tests applying to the model as above. All relation macros like source and ref used in the model needs to be overridden with mock data. Even the expectation, which are expected rows, is also declared in a SQL file.

My tip is to prepare mock data in CSV files in a user friendly editor, e.g., Excel, then convert to SQL selection statements by an online tool, which is a heap on the internet.

A fixture’s mock data in SQL format

Execution

Now run this command to execute the test.
```
dbt test -s Stg_Vacancy_Referrals test_type:unit
```
Issue

DBT log states that there is an error around the WITH statement, but I am pretty sure there’s nothing wrong with it. After checking the log for a while and looking around for the implementation of the unit test feature in DBT Fabric, I found this.

Path: .venv/Lib/site-packages/dbt/include/fabric/macros/materializations/tests/helpers.sql

Unit test SQL code generator

The Fabric adapter put the main_sql, which is the actual code of my model, inside a subquery. Since my model uses CTEs, the generated SQL code won’t work.

Solution

My idea is to use a regular expression to extract the final selection in my model and put all CTEs on top of that macro’s SQL query. Fortunately, the idea becomes feasible because DBT supports some Python modules including re, which is for regular expression, in their Jinja template engine. Technically, I can compose an expression and test it in Python before putting into those DBT’s Jinja macros.

Disclaimer: I am not a Regex Expert, so please do not expect my expressions would work with every coding style. You can enforce the style by a SQL lint to minimize the effort of maintaining your expressions.

About modules variable | dbt Developer Hub
`modules` Jinja variables has useful Python modules to operate data.

My regular expression is something like below. Firstly, I use the regex to check if the main_query has any CTEs. If it doesn’t have one, just stick with the default code.

CTE Regex Pattern

When the main_query has CTEs, another regex is used to extract the final selection statement. The final selection is removed from the main_query that is now the CTE definition part. Next, put the CTE definition part on top then the internal unit test CTEs. The implementation as below.

The fix’s actual code

I thought it was over and could be able to run the unit test gain without any issues. I was wrong. Another part in this process got the similar issue too. It is the get column name macro, which is used to find out column name of any SQL query.

Path: .venv/Lib/site-packages/dbt/include/fabric/macros/adapters/columns.sql

get_columns_in_query macro

I can see the same issue lying in here. The fix is the same as applied to the unit test code generator.

The fixed version

Result

After applying those fixes, the unit test now can run with my model. The result is as below.

Unit Test result

My model was not deployed, but SQL codes were actually executed in Fabric Warehouse.

Summary

I find the unit test feature super helpful. During doing the physical data modeling, I can quickly test some models without creating/deploying all related models but just use mock data. This become handy when the DBT project gradually grows in time with models convoluted to models, and this can save a lot of time to measure the impact of a change in a super complex model to its downstream.

Hope you find this article helpful and informative.
09/09/2024
How I put PowerBI Semantic Models in DBT Documentation Site
DBT is very handy when comes with a data documentation site generation packing data model lineages. It, however, is disconnected from exposures, which can be any application using models documented in DBT. To add exposures into DBT requires a little efforts by either manually write yml files or generate them automatically depending on how the data-consuming app or BI tool can support.

At Workskil, we use PowerBI as the main BI tool beside very few Excel reports. The very first approach of our team was to use the default format, PBIX, and deploy semantic datasets and reports separately. We do have deployment pipelines and can see historical deployments, but a question echoed in my mind was how can our team can track changes happening in both datasets and reports. Git Integration option in Workspace Settings was the only option, but requires some basic Git and a bit knowledge of developing SSRS reports as well as a Premium Licence.

PowerBI Project (preview feature) is the game changer that helps me not only to bridge the gap between PowerBI side and DBT but also provides a simpler solution to version control PowerBI datasets and reports. Furthermore, Business Intelligence Analysts only need to learn basic Git commands or just use Visual Studio Code or any Git GUI-tool to submit changes. Since a PowerBI Project for a semantic model stores information around data connections and queries, it sparks a new idea in me that will extract table names from its model file and craft exposure yml files for DBT. Yet to mention that using PowerBI Project save us a Premium Licence if we want to depart from using Premium Capacity.
```
graph LR
    bi_desktop[PowerBI Desktop]

    subgraph git_bi[PowerBI Repo]
        file[.PBIP]@{shape: doc}
    end

    subgraph artifact[Artifact]
        exposure[DBT exposure YML]
        doc[DBT Doc Site]
    end

    subgraph devops[Azure DevOps]
        bi_devops[PowerBI Pipeline]
        dbt_devops[DBT Pipeline]
    end

    bi_desktop [email protected]>|1.push| file
    git_bi [email protected]>|2.trigger| bi_devops
    bi_devops l3@-->|3.generate| exposure
    dbt_devops -.->|use| exposure
    dbt_devops l4@-->|4.generate| doc

    l1@{animate: true}
    l2@{animate: true}
    l3@{animate: true}
    l4@{animate: true}

    linkStyle 0,1,2,4 stroke: blue;
```
Save A Dataset As A PowerBI Project

To turn on the PowerBI Project save option in PowerBI Desktop, it can be found in Options/Preview Features. Note that, please don’t turn on Store semantic model using TMDL format due to these reasons:
- TMDL is a new format and the only parser I could found is in a C# feature package, which is subject to change. I will need to wait for an official parser in Python, unless I am willing to implement it by myself, which costs a lot of time.
- In contrast, the default format for every file in a PowerBI Project is JSON, which is common and has heap of parsers in every programming language.
PowerBI Project save option in PowerBI Desktop.

Once saved a semantic model as a PowerBI Project, the below screenshot demonstrates how the folder structure looks like. The pbip file is the main entry to open the dataset in PowerBI Desktop. The .Report folder is where configurations of all pages, visuals, and everything that you see in PowerBI Desktop will be stored in there, while all tables, references, and transformation steps in PowerQuery are stored in the .SemanticModel folder.

A semantic model’s file structure after stored as a PowerBI Project.

Parse model.bim File

Sql.Database is the main Power Query M function to connect and extract data from a target database. It uses 3 parameters if we want to use a SQL query. Otherwise, it uses 2 parameters to create a connection firstly before selecting a view/table in the second step. Remember these patterns as I will need them later in my model.bim parsing file. My approach is to build a Python script with a bunch of Regular Expressions that will extract used parameters in a SQL.Database function then use a SQL parser for parsing the SQL query in Query section.

There are many SQL parser packages can do the job with SQL queries in model.bim. Nevertheless, some of my queries use OPENROWSET leading to syntax errors when I used those parser packages. I finally found sql-metadata because of asking CoPilot in Bing, which surprised me as the package is exactly what I need in my situation. The package is handy and saved me a lot of time as it doesn’t validate syntax in a SQL query but aims to extract table names only.

Import Mode: use SQL query.

Import Mode: view/table selection.

CI/CD Pipeline & Publish Exposure YML Files

After extracting of table/view names in all semantic models, prepare an exposure yml file having Python string format variables like the below screenshot. It should have at least 2 placeholders for the dataset name and list of used table/view names wrapped in ref function as I assume none on my team members doesn’t connect to a raw (source in DBT) file. In a newer version of my parser, the template file has the url parameterised as I use PowerBI Restful API to extract a dataset url in PowerBI Service.

Exposure file with Python string format’s placeholders.

The final exposure file.

We have 2 separate repositories for the DBT project and PowerBI Projects, so we would want to have all generated YML files packaged as an artifact in DevOps. This artifact will be promoted to @Release and used in the DBT project’s pipeline generating the documentation site.

CI/CD pipeline running the parse script, packing and publishing generated files as an artifact.

The final result in the DBT documentation site.

Conclusion

We have 2 separate repositories for the DBT project and PowerBI Projects, so we would want to have all generated YML files packaged as an artifact in DevOps. This artifact will be promoted to @Release and used in the DBT project’s pipeline generating the documentation site.
20/04/2024