Enforce YAML documentation in dbt with GitHub Actions

Introduction

Below you can find a python script leveraging pydantic library. It scans all of your model’s YAML definitions and checks for constraints on the fields you define. The structure proposed is the one I found working for a minimum amount of documentation being enforced, while keeping the rules not too strict — to facilitate quicker development if needed. You can adjust these rules to your own organisational standards.

Necessary imports

Here is the list of libraries you will be using to achieve this task:

import glob # to iterate over your files
import pprint # to lint your printed errors
import re # to match with regular expressions
from pathlib import Path # to scan your files' paths
from typing import Any, Dict, Optional # to standardise checks input

import yaml # to evaluate your YAML files
from pydantic import ( # to execute model testing based on predefined schema
    BaseModel,
    Field,
    ValidationError,
    conlist,
    field_validator,
)

Build base model

First thing you want to achieve is to give a ‘schema’ to the pydantic script it will be looking for when scanning YAML files. This schema defines the minimum fields you want to ensure each model definition contains.

Here is an example:

class DBTModel(BaseModel):
    name: str = Field(min_length=3) # setting min length of name to 3 chars
    description: str = Field(min_length=10) # setting min length of description to 10chars
    meta: DBTModelMeta
    additional_args: Optional[Dict[str, Any]] = {}

You might have noticed meta: DBTModelMeta definition, which corresponds to a separately built class. This is a great approach to standardise the meta section of your YAML documentation. It allows to enforce model ownership, tagging or any other custom dimensions you want to add there. It is extremely helpful when using dbt docs later or propagating this meta information to eg. Atlan. Please see the definition below.

Build meta constraints

Let me give you a simple example of standardising the way of creating model owner and also restricting the possible values. You can draw from this example and expand to whichever further fields you would like to be checking for.

We are providing a list of allowed values — each owner set in your YAML must be a value from the allowed list.
We create a separate class for meta validation where a custom python function evaluates the length of the list of owners as well as compliance with the allowed list. Notice the field_validator decorator, which tells pydantic to execute these checks.

ALLOWED_DATA_DOMAINS_LIST = [
    "data-engineering",
    "data-science",
    "analytics-engineering",
    "business-analytics",
    "product-analytics",
]


# Define a Pydantic model for the data in schema .yml files
class DBTModelMeta(BaseModel):
    owners: conlist(str, min_length=1)

    @field_validator("owners")
    def validate_owner_team(cls, value):
        if len(value) > 1 and value[1] not in ALLOWED_DATA_DOMAINS_LIST:
            raise ValueError(
                f"Owner team should be one of the following: {ALLOWED_DATA_DOMAINS_LIST}"
            )
        return value

You can follow the above approach of creating classes and keep extending your checks as much as you want (eg. to build a separate class for column documentation validation). Word of caution though — if you need to do quick fixes or urgent development and your checks are extremely thorough — you will have a lot of failed CI checks and possibly a very frustrated team, especially the members who do not work with dbt that often.

Check if all models have corresponding YAML files

This check only works when you follow the best practices of creating a new YAML file for every single model, with the same file name but different extension. If you keep all definitions in one file or follow any other approach — you are free to either adjust this part or completely skip it.

When creating this part, I have taken into account a scenario where you are leveraging dbt versions feature. This means that for versioned models you have only one source YAML file. This is accounted for by removing the _v<n> postfixes when iterating over your file list.

Here are the steps:

Set errors variable to false — it is useful to throw errors easily across the script.
Set the postfix pattern — this is to account for the versioned models.
Iterate over all .sql files in your models directory and remove the postfix pattern.
Check if the YAML file exists for each .sql file found.
Throw an error if YAML definition is missing.

errors = False

# Regular expression to match postfixes like _v1, _v2, etc.
postfix_pattern = re.compile(r"_v\d+$")

# Glob through all SQL files in the models directory
for sql_file in glob.glob(
    "<dbt_root_directory>/models/**/*.sql",
    recursive=True,
):
    sql_file_path = Path(sql_file)
    # Strip the postfix from the file name
    base_name = sql_file_path.stem
    base_name_no_postfix = postfix_pattern.sub("", base_name)
    yml_file_path = sql_file_path.with_name(base_name_no_postfix).with_suffix(
        ".yml"
    )
    if not yml_file_path.exists():
        errors = True
        print(f"Corresponding YAML file not found for: {sql_file_path}")

# If any SQL file is missing its corresponding YAML file
if errors:
    raise Exception(
        "Please ensure corresponding YAML files exist for all SQL files. They must have the exact same name as the SQL model file name."
    )

There are multiple components to these checks. Bear in mind this is an executable and working option with the parameters and checks I described above. If you have a slightly different structure or need further adjustments — you will need to modify the script accordingly.

Iterate over all of your YAML models and validate the fields present

And now, to the grand finale — the actual loop which evaluates the checks you set up with pydantic library. It is built in a way that works when we have one or more models defined in one YAML file. It also skips the YAML definitions for sources or other objects — it checks for models only. The errors thrown are formatted in a way which allows you to quickly identify where the issue is coming from. Here are the steps:

Iterate through all models defined in YAMl files in your models directory.
Load YAML files to a stream.
Iterate over each model definition checking against the schema defined with pydantic classes.
Throw an error if any of the fields does not match.

# Glob through all schema .yml files in the models directory
for filename in glob.glob(
    "<dbt_root_directory>/models/**/*.yml", recursive=True
):
    with open(filename, "r") as stream:
        try:
            yaml_file = yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)

        try:
            yaml_models = yaml_file.get("models", [])
        # continuing the loop in case the YAML file is for sources or something else
        except Exception:
            continue

        # adding safety check in case more than one model are defined
        for (
            model
        ) in yaml_models:  # Assuming 'models' key contains a list of models
            try:
                schema = DBTModel(**model)
            except ValidationError as e:
                errors = True
                print("##############################")
                print(
                    f"ERROR in model: {model.get('name')} in file: {filename}"
                )
                for x in e.errors():
                    print(f"- Field \"{x['loc'][0]}\": {x['type']}")
                    print(f"\t - {x['msg']}")

# If any model is missing any required field
if errors:
    print("Expected Yaml schema for models:")
    pprint.pprint(DBTModel.model_fields, indent=3)
    raise Exception("Please update the models specified above.")

Full executable script

Please find below all of the above steps merged into an executable script. In the next section I also provide a snippet for writing a GitHub Action job which automatises its execution. Feel free to copy paste it into your own deployment. If you have any improvements suggestions — leave the comment below!

import glob
import pprint
import re
from pathlib import Path
from typing import Any, Dict, Optional

import yaml
from pydantic import (
    BaseModel,
    Field,
    ValidationError,
    conlist,
    field_validator,
)

ALLOWED_DATA_DOMAINS_LIST = [
    "data-engineering",
    "data-science",
    "analytics-engineering",
    "business-analytics",
    "product-analytics",
]


# Define a Pydantic model for the data in schema .yml files
class DBTModelMeta(BaseModel):
    owners: conlist(str, min_length=1)

    @field_validator("owners")
    def validate_owner_team(cls, value):
        if len(value) > 1 and value[1] not in ALLOWED_DATA_DOMAINS_LIST:
            raise ValueError(
                f"Owner team should be one of the following: {ALLOWED_DATA_DOMAINS_LIST}"
            )
        return value

class DBTModel(BaseModel):
    name: str = Field(min_length=3)
    description: str = Field(min_length=10)
    meta: DBTModelMeta
    additional_args: Optional[Dict[str, Any]] = {}

errors = False

# Regular expression to match postfixes like _v1, _v2, etc.
postfix_pattern = re.compile(r"_v\d+$")

# Glob through all SQL files in the models directory
for sql_file in glob.glob(
    "<dbt_root_directory>/models/**/*.sql",
    recursive=True,
):
    sql_file_path = Path(sql_file)
    # Strip the postfix from the file name
    base_name = sql_file_path.stem
    base_name_no_postfix = postfix_pattern.sub("", base_name)
    yml_file_path = sql_file_path.with_name(base_name_no_postfix).with_suffix(
        ".yml"
    )
    if not yml_file_path.exists():
        errors = True
        print(f"Corresponding YAML file not found for: {sql_file_path}")

# If any SQL file is missing its corresponding YAML file
if errors:
    raise Exception(
        "Please ensure corresponding YAML files exist for all SQL files. They must have the exact same name as the SQL model file name."
    )


# Glob through all schema .yml files in the models directory
for filename in glob.glob(
    "<dbt_root_directory>/models/**/*.yml", recursive=True
):
    with open(filename, "r") as stream:
        try:
            yaml_file = yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)

        try:
            yaml_models = yaml_file.get("models", [])
        # continuing the loop in case the YAML file is for sources or something else
        except Exception:
            continue

        yaml_model_names = set()
        # adding safety check in case more than one model are defined
        for (
            model
        ) in yaml_models:  # Assuming 'models' key contains a list of models
            try:
                schema = DBTModel(**model)
            except ValidationError as e:
                errors = True
                print("##############################")
                print(
                    f"ERROR in model: {model.get('name')} in file: {filename}"
                )
                for x in e.errors():
                    print(f"- Field \"{x['loc'][0]}\": {x['type']}")
                    print(f"\t - {x['msg']}")

# If any model is missing any required field
if errors:
    print("Expected Yaml schema for models:")
    pprint.pprint(DBTModel.model_fields, indent=3)
    raise Exception("Please update the models specified above.")

Implement in GitHub Action

Here is a simple definition of how the above checks can be implemented in your GitHub Actions pipeline. Some of the steps are not mandatory but make your deployment more lean and save you GitHub Actions minutes on executing this only when actually needed. Here are the steps:

(Optional) add workflow dispatch to trigger the job manually at any point.
(Optional) specify pull request types on which this workflow will run.
(Optional) specify branches on which this workflow will run.
(Optional) specify paths to scan for changes and run the workflow only if a change happened within the path specified.
(Optional) specify concurrency — what happens to this job when you push another commit while the job is still executing. In this case we cancel the currently running and start a new one.
Specify where the job should run; ubuntu-latest in my case.
Checkout the code — this is a native GitHub action.
Set up python — also a native GitHub action to allow your Python to execute.
Install required dependencies.
Run the executable script we defined earlier.

name: Validate model metadata

on:
 workflow_dispatch:
 pull_request:
   types:
     - opened
     - reopened
     - synchronize
   branches:
     - main
   paths:
     - '<your_dbt_path>/models/**/*.yml'

concurrency:
  group: "validate-model-metadata-pr-${{ github.event.pull_request.number }}"
  cancel-in-progress: true

jobs:
 run-python-script:
   runs-on: ubuntu-latest

   steps:
   - name: Check out code
     uses: actions/checkout@v4

   - name: Set up Python
     uses: actions/setup-python@v5
     with:
       python-version: '3.x'

   - name: Install required dependencies
     run: |
       python -m pip install --upgrade pip
       pip install -U pyyaml
       pip install -U pydantic
       pip install -U typing

   - name: Validate model metadata in schema.yml
     run: python <path_to_your_executable_file>

Contact Me

Thanks for reading. Are you liking the information received but lacking time or skillset to get your analytics engineering sorted? Check out my contact details.

Enforce YAML documentation in dbt with GitHub Actions was originally published in Lortech Solutions Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.