From Zero to Hero: dbt with Ibis Framework

A complete guide from dbt fundamentals to Python-powered transformations


Introduction

After completing the Complete dbt Data Build Tool Bootcamp, I realized dbt brings software engineering best practices to data transformation. But what if you could combine dbt’s structure with Python’s flexibility? That’s where dbt-ibis comes in.

This guide will take you from zero to hero – starting with dbt fundamentals, then showing you how to supercharge your transformations with Ibis framework. We’ll use DuckDB as our data warehouse.

Part 1: Understanding dbt Basics

What is dbt?

dbt (data build tool) is an open-source command-line tool that enables analysts and engineers to:

  • Transform data using SQL (or Python with Ibis)
  • Manage data models with version control
  • Test data quality automatically
  • Document your data pipeline
  • Deploy changes safely

Think of dbt as “software engineering practices for data” – it brings Git workflows, testing, and modularity to your data transformations.

dbt Core Concepts

1. Models

Models are SQL or Python files that transform raw data into useful tables/views.

SQL Model (models/customer_orders.sql):

SELECT
    customer_id,
    COUNT(*) as order_count
FROM {{ ref('orders') }}
GROUP BY customer_id

2. Sources

Define where your raw data lives (Snowflake, BigQuery, DuckDB, etc.).

models/sources.yml:

version: 2

sources:
  - name: raw_data
    schema: main
    tables:
      - name: orders
      - name: customers

3. Seeds

CSV files stored in your project that dbt loads as tables. Perfect for small reference data.

seeds/customer_segments.csv:

customer_id,segment_name,segment_priority
1,Enterprise,High
2,SMB,Medium
3,Startup,Low

Load seeds with:

dbt seed

Reference in models:

SELECT
    c.*,
    s.segment_name
FROM {{ ref('customers') }} c
LEFT JOIN {{ ref('customer_segments') }} s ON c.customer_id = s.customer_id

4. Snapshots

Track changes to your data over time (Type 2 Slowly Changing Dimensions).

snapshots/customer_snapshot.sql:

{% snapshot customer_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

SELECT * FROM {{ source('raw_data', 'customers') }}

{% endsnapshot %}

5. Refs and Sources

  • {{ ref('model_name') }} – Reference another model
  • {{ source('source_name', 'table_name') }} – Reference raw data
  • {{ ref('seed_name') }} – Reference a seed

6. Tests

Ensure data quality with built-in and custom tests.

models/schema.yml:

version: 2

models:
  - name: customer_orders
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

dbt Project Structure

my_dbt_project/
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Database connections (usually in ~/.dbt/)
├── models/
│   ├── sources.yml          # Raw data definitions
│   ├── schema.yml           # Tests and docs
│   ├── staging/             # Cleaned raw data
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   └── marts/               # Business logic
│       └── customer_metrics.sql
├── seeds/                   # CSV files
├── snapshots/               # Historical tracking
└── tests/                   # Custom tests

Essential dbt Commands

# Setup and validation
dbt debug                    # Test connection
dbt deps                     # Install dependencies

# Development
dbt run                      # Run all models
dbt run --select stg_orders  # Run specific model
dbt run --select +customer_orders  # Run model + upstream

# Testing
dbt test                     # Run all tests
dbt test --select customer_orders  # Test specific model

# Documentation
dbt docs generate            # Generate docs
dbt docs serve               # View docs locally

# Other
dbt seed                     # Load CSV files
dbt snapshot                 # Capture historical data
dbt compile                  # Compile without running

Your First dbt Project

Step 1: Setup

# Install dbt with DuckDB adapter
pip install dbt-core dbt-duckdb

# Initialize project
dbt init my_project

# Navigate to project
cd my_project

Step 2: Configure DuckDB

~/.dbt/profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "analytics.duckdb"

Step 3: Create Sources

models/sources.yml:

version: 2

sources:
  - name: jaffle_shop
    schema: main
    tables:
      - name: orders
      - name: customers

Step 4: Create Staging Models

models/staging/stg_orders.sql:

SELECT
    id as order_id,
    customer_id,
    order_date,
    status
FROM {{ source('jaffle_shop', 'orders') }}

Step 5: Create Business Logic

models/marts/customer_orders.sql:

SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    MIN(o.order_date) as first_order,
    MAX(o.order_date) as most_recent_order,
    COUNT(o.order_id) as number_of_orders
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name

Step 6: Run and Test

dbt run
dbt test
dbt docs generate && dbt docs serve

Part 2: Why Switch to Ibis?

Now that you understand dbt fundamentals, let’s see why Ibis is a game-changer.

The Problem with Pure SQL

SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    MIN(o.order_date) as first_order,
    MAX(o.order_date) as most_recent_order,
    COUNT(o.order_id) as number_of_orders,
    SUM(o.amount) as total_spent
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING COUNT(o.order_id) >= 2
ORDER BY total_spent DESC

Issues with SQL:

  • No type checking
  • Hard to reuse logic
  • Difficult to unit test
  • Limited IDE autocomplete
  • Database-specific syntax variations
  • Hard to debug complex queries

The Ibis Solution

Ibis Model (models/marts/customer_orders.ibis):

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_customers"), ref("stg_orders"))
def model(stg_customers, stg_orders):
    """Calculate customer order metrics with Python"""
    # Filter completed orders
    completed_orders = stg_orders.filter(stg_orders["status"] == "completed")

    # Aggregate by customer
    customer_orders = completed_orders.group_by("customer_id").aggregate(
        first_order=completed_orders["order_date"].min(),
        most_recent_order=completed_orders["order_date"].max(),
        number_of_orders=completed_orders.count(),
        total_spent=completed_orders["amount"].sum()
    )

    # Join with customers
    result = stg_customers.join(customer_orders, "customer_id", how="left")

    # Filter and order
    return result.filter(
        result["number_of_orders"] >= 2
    ).order_by(result["total_spent"].desc())

Benefits of Ibis:

  • Type safety – Catch errors before execution
  • Reusable functions – Build complex transformations
  • Unit testing – Test with pytest
  • IDE support – Autocomplete and refactoring
  • Backend agnostic – Works with DuckDB, Snowflake, BigQuery, etc.
  • Python ecosystem – Use any Python library for logic

Comparison: SQL vs Ibis

FeatureSQLIbis
Type SafetyNoneFull Python typing
TestingLimitedpytest integration
ReusabilityCopy-paste or JinjaFunctions & classes
DebuggingHardPython debugger
IDE SupportBasicFull autocomplete
Multi-backendDifferent syntaxSame code

Part 3: Implementing dbt with Ibis

Installation

# Install dbt-ibis with DuckDB support
pip install dbt-ibis dbt-duckdb "ibis-framework[duckdb]"

# Verify installation
dbt --version
dbt-ibis --version

Project Structure with Ibis

my_dbt_project/
├── dbt_project.yml
├── models/
│   ├── sources.yml
│   ├── schema.yml
│   ├── staging/
│   │   ├── stg_orders.ibis        # Note: .ibis extension
│   │   ├── stg_customers.ibis
│   │   └── __ibis_sql/            # Auto-generated SQL (don't edit)
│   │       ├── stg_orders.sql
│   │       └── stg_customers.sql
│   └── marts/
│       ├── customer_orders.ibis
│       └── __ibis_sql/
│           └── customer_orders.sql
├── seeds/
│   └── customer_segments.csv
├── snapshots/
│   └── customer_snapshot.sql      # Snapshots remain in SQL
└── tests/
    └── test_customer_orders.ibis

Important: Ibis models use the .ibis file extension, NOT .py. The dbt-ibis CLI automatically compiles these to SQL files in __ibis_sql/ subfolders.

Configuration

~/.dbt/profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "analytics.duckdb"

dbt_project.yml:

name: "my_project"
version: "1.0.0"
config-version: 2

profile: "my_project"

model-paths: ["models"]
seed-paths: ["seeds"]
snapshot-paths: ["snapshots"]
test-paths: ["tests"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  my_project:
    staging:
      +materialized: view
    marts:
      +materialized: table

Building Models with Ibis

1. Define Sources with Column Data Types

Critical: For non-Ibis models, sources, seeds, and snapshots, you must specify column data types. This is how dbt-ibis knows the schema.

models/sources.yml:

version: 2

sources:
  - name: jaffle_shop
    schema: main
    tables:
      - name: orders
        columns:
          - name: id
            data_type: integer
          - name: customer_id
            data_type: integer
          - name: order_date
            data_type: date
          - name: status
            data_type: varchar
          - name: amount
            data_type: decimal
      - name: customers
        columns:
          - name: id
            data_type: integer
          - name: first_name
            data_type: varchar
          - name: last_name
            data_type: varchar
          - name: email
            data_type: varchar

Tip: Use the dbt-codegen package to auto-generate these column definitions:

dbt run-operation generate_source --args '{"database_name": "main", "schema_name": "main", "generate_columns": true, "table_names": ["orders", "customers"]}'

2. Create Staging Models

models/staging/stg_orders.ibis:

from dbt_ibis import depends_on, source

@depends_on(source("jaffle_shop", "orders"))
def model(orders):
    """Clean and standardize orders data"""
    return orders.select(
        order_id=orders["id"],
        customer_id=orders["customer_id"],
        order_date=orders["order_date"],
        status=orders["status"],
        amount=orders["amount"]
    )

models/staging/stg_customers.ibis:

from dbt_ibis import depends_on, source

@depends_on(source("jaffle_shop", "customers"))
def model(customers):
    """Standardize customer data"""
    return customers.select(
        customer_id=customers["id"],
        first_name=customers["first_name"],
        last_name=customers["last_name"],
        email=customers["email"].lower()
    )

3. Create Business Logic Models

models/marts/customer_orders.ibis:

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_customers"), ref("stg_orders"))
def model(stg_customers, stg_orders):
    """
    Calculate customer order metrics:
    - First order date
    - Most recent order
    - Total number of orders
    - Total spent
    """
    # Filter for completed orders only
    completed_orders = stg_orders.filter(stg_orders["status"] == "completed")

    # Aggregate by customer
    customer_orders = completed_orders.group_by("customer_id").aggregate(
        first_order=completed_orders["order_date"].min(),
        most_recent_order=completed_orders["order_date"].max(),
        number_of_orders=completed_orders.count(),
        total_spent=completed_orders["amount"].sum()
    )

    # Join with customer details
    result = stg_customers.join(customer_orders, "customer_id", how="left")

    # Select final columns
    return result.select(
        "customer_id",
        "first_name",
        "last_name",
        "first_order",
        "most_recent_order",
        "number_of_orders",
        "total_spent"
    )

Note: Parameter names in the model() function must match the names used in ref() and source(). For example, ref("stg_customers") means the parameter should be stg_customers.

Running Your Ibis Project

The key command is dbt-ibis instead of dbt:

# Run all models (compiles .ibis to .sql, then runs dbt)
dbt-ibis run

# Run specific model
dbt-ibis run --select customer_orders

# Run with upstream dependencies
dbt-ibis run --select +customer_orders

# Run tests
dbt-ibis test

# Generate documentation
dbt-ibis docs generate
dbt-ibis docs serve

# Load seeds (seeds are still CSV, no Ibis)
dbt-ibis seed

# Run snapshots (snapshots are still SQL)
dbt-ibis snapshot

Understanding the Compilation Process

# This single command:
dbt-ibis run

# Is equivalent to:
dbt-ibis precompile    # Convert .ibis files to .sql
dbt run                # Run the generated SQL with dbt

The precompile step generates SQL files in __ibis_sql/ subfolders. You can inspect these to debug your Ibis expressions.

Shell Alias (Optional)

To continue using dbt command instead of dbt-ibis, add this alias to your ~/.bashrc or ~/.zshrc:

alias dbt="dbt-ibis"

Part 4: Working with Seeds and Snapshots

Seeds with Ibis

Seeds remain CSV files, but you can reference them in Ibis models.

seeds/customer_segments.csv:

customer_id,segment_name,segment_priority
1,Enterprise,High
2,SMB,Medium
3,Startup,Low

seeds/schema.yml (required for Ibis to know the types):

version: 2

seeds:
  - name: customer_segments
    columns:
      - name: customer_id
        data_type: integer
      - name: segment_name
        data_type: varchar
      - name: segment_priority
        data_type: varchar
        tests:
          - accepted_values:
              values: ['High', 'Medium', 'Low']

models/marts/enriched_orders.ibis:

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_orders"), ref("customer_segments"))
def model(stg_orders, customer_segments):
    """Enrich orders with customer segments"""
    return stg_orders.join(
        customer_segments,
        "customer_id",
        how="left"
    ).select(
        "order_id",
        "customer_id",
        "order_date",
        "amount",
        "segment_name",
        "segment_priority"
    )

Snapshots

Snapshots currently remain in SQL (dbt-ibis limitation). You can still reference them in Ibis models.

snapshots/customer_snapshot.sql:

{% snapshot customer_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

SELECT * FROM {{ source('jaffle_shop', 'customers') }}

{% endsnapshot %}

models/schema.yml (define snapshot columns for Ibis):

version: 2

snapshots:
  - name: customer_snapshot
    columns:
      - name: customer_id
        data_type: integer
      - name: first_name
        data_type: varchar
      - name: last_name
        data_type: varchar
      - name: email
        data_type: varchar
      - name: dbt_valid_from
        data_type: timestamp
      - name: dbt_valid_to
        data_type: timestamp

models/marts/customer_history.ibis:

from dbt_ibis import depends_on, ref

@depends_on(ref("customer_snapshot"))
def model(customer_snapshot):
    """Analyze customer changes over time"""
    # Get current customers (dbt_valid_to is null)
    current = customer_snapshot.filter(
        customer_snapshot["dbt_valid_to"].isnull()
    )

    return current.select(
        "customer_id",
        "first_name",
        "last_name",
        "dbt_valid_from"
    )

Part 5: Writing Tests with Ibis

Singular Tests

Create .ibis files in your tests/ folder with a test function (not model).

tests/test_no_duplicate_names.ibis:

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_customers"))
def test(stg_customers):
    """Test that no customer has first_name equal to last_name"""
    return stg_customers.filter(
        stg_customers["first_name"] == stg_customers["last_name"]
    )

The test passes if the query returns zero rows.

Unit Testing with pytest

You can also write unit tests outside of dbt using pytest and Ibis memtables:

tests/unit/test_transformations.py:

import pytest
import ibis

def test_customer_orders_aggregation():
    """Test that customer orders aggregation works correctly"""

    # Create test data using memtables
    test_orders = ibis.memtable({
        "order_id": [1, 2, 3, 4],
        "customer_id": [1, 1, 2, 3],
        "order_date": ["2024-01-01", "2024-01-15", "2024-01-20", "2024-02-01"],
        "status": ["completed", "completed", "completed", "pending"],
        "amount": [100.0, 150.0, 200.0, 50.0]
    })

    # Apply the same logic as your model
    completed_orders = test_orders.filter(test_orders["status"] == "completed")

    result = completed_orders.group_by("customer_id").aggregate(
        order_count=completed_orders.count(),
        total_spent=completed_orders["amount"].sum()
    ).execute()

    # Assert expectations
    assert len(result) == 2  # Only customers 1 and 2 have completed orders

    customer_1 = result[result["customer_id"] == 1].iloc[0]
    assert customer_1["order_count"] == 2
    assert customer_1["total_spent"] == 250.0

Run with:

pytest tests/unit/

Part 6: Advanced Ibis Patterns

Complex Aggregations

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_orders"))
def model(stg_orders):
    """Customer lifetime value analysis"""
    return stg_orders.group_by("customer_id").aggregate(
        first_order=stg_orders["order_date"].min(),
        last_order=stg_orders["order_date"].max(),
        total_spent=stg_orders["amount"].sum(),
        avg_order_value=stg_orders["amount"].mean(),
        order_count=stg_orders.count()
    )

Window Functions

from dbt_ibis import depends_on, ref

@depends_on(ref("stg_orders"))
def model(stg_orders):
    """Rank orders by customer"""
    return stg_orders.mutate(
        order_rank=stg_orders["order_id"].rank().over(
            group_by="customer_id",
            order_by="order_date"
        ),
        running_total=stg_orders["amount"].sum().over(
            group_by="customer_id",
            order_by="order_date"
        )
    )

Conditional Logic with Case Statements

from dbt_ibis import depends_on, ref
import ibis

@depends_on(ref("stg_orders"))
def model(stg_orders):
    """Categorize order values"""
    return stg_orders.mutate(
        order_category=ibis.case()
            .when(stg_orders["amount"] < 50, "Small")
            .when(stg_orders["amount"] < 200, "Medium")
            .else_("Large")
            .end()
    )

Reusable Functions

Create reusable transformation functions:

# models/utils/transformations.py (regular Python file)
def add_fiscal_quarter(table, date_column):
    """Add fiscal quarter based on date column"""
    import ibis
    month = table[date_column].month()
    return table.mutate(
        fiscal_quarter=ibis.case()
            .when(month.isin([1, 2, 3]), "Q1")
            .when(month.isin([4, 5, 6]), "Q2")
            .when(month.isin([7, 8, 9]), "Q3")
            .else_("Q4")
            .end()
    )
# models/marts/orders_with_quarter.ibis
from dbt_ibis import depends_on, ref
from utils.transformations import add_fiscal_quarter

@depends_on(ref("stg_orders"))
def model(stg_orders):
    """Add fiscal quarter to orders"""
    return add_fiscal_quarter(stg_orders, "order_date")

Part 7: Configuration Tips

Column Name Casing (Snowflake, etc.)

For databases like Snowflake that store identifiers in uppercase:

dbt_project.yml:

vars:
  # Format: dbt_ibis_letter_case_in_db_{profile}_{target}
  dbt_ibis_letter_case_in_db_my_project_prod: upper
  dbt_ibis_letter_case_in_expr: lower

This lets you write lowercase column names in your Ibis code while Snowflake stores them in uppercase.

VS Code Configuration

Add to your VS Code settings.json:

{
    "files.associations": {
        "*.ibis": "python"
    }
}

Part 8: Limitations and Workarounds

Current dbt-ibis Limitations

  1. No database connection in models – Ibis models generate SQL; they cannot query the database directly
  2. Column types required – Non-Ibis sources, seeds, snapshots, and SQL models need explicit column type definitions
  3. Snapshots remain SQL – Currently no Ibis support for snapshot definitions

Workarounds

For complex queries requiring database connection: Use a SQL model for that specific transformation, then reference it in your Ibis models.

For auto-generating column types:

# Install dbt-codegen
dbt deps  # After adding to packages.yml

# Generate source definitions
dbt run-operation generate_source \
  --args '{"database_name": "main", "schema_name": "main", "generate_columns": true, "table_names": ["orders"]}'

# Generate model column definitions
dbt run-operation generate_model_yaml \
  --args '{"model_names": ["stg_orders"]}'

Part 9: CI/CD Integration

GitHub Actions Example

.github/workflows/dbt.yml:

name: dbt CI

on: [push, pull_request]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install dbt-ibis dbt-duckdb "ibis-framework[duckdb]"
          pip install pytest

      - name: Verify Ibis compilation is up to date
        run: |
          dbt-ibis precompile
          # Check if any files changed
          if [[ -n $(git status --porcelain) ]]; then
            echo "Error: dbt-ibis precompile generated changes. Please commit them."
            git status
            exit 1
          fi

      - name: Run dbt
        run: |
          dbt-ibis run
          dbt-ibis test

      - name: Run pytest
        run: pytest tests/unit/

Pre-commit Hook

.pre-commit-config.yaml:

repos:
  - repo: local
    hooks:
      - id: dbt-ibis-precompile
        name: dbt-ibis precompile
        entry: dbt-ibis precompile
        language: system
        pass_filenames: false
        files: '\.ibis$'

Part 10: Switching Backends

One of Ibis’s greatest strengths is backend portability. Your models work unchanged across databases.

Development with DuckDB, Production with Snowflake

~/.dbt/profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "analytics.duckdb"

    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: analyst
      database: analytics
      warehouse: compute_wh
      schema: dbt_analytics

Run against different targets:

# Development
dbt-ibis run

# Production
dbt-ibis run --target prod

Your Ibis models remain unchanged – they automatically compile to the appropriate SQL dialect!


Conclusion

This guide took you from zero to hero – starting with dbt fundamentals and ending with Python-powered transformations using Ibis.

What You Now Know

  • dbt fundamentals: Models, sources, refs, tests, seeds, snapshots
  • dbt commands: run, test, docs, seed, snapshot
  • dbt-ibis syntax.ibis files, @depends_onref()source()
  • Column type requirements: Why and how to specify them
  • Testing: Singular tests and pytest integration
  • Advanced patterns: Window functions, aggregations, reusable code
  • CI/CD: GitHub Actions integration
  • Multi-backend: Same code for DuckDB and Snowflake

Quick Reference

TaskCommand
Run all modelsdbt-ibis run
Run specific modeldbt-ibis run --select model_name
Run testsdbt-ibis test
Load seedsdbt-ibis seed
Run snapshotsdbt-ibis snapshot
Generate docsdbt-ibis docs generate
Precompile onlydbt-ibis precompile

Supported Backends

dbt-ibis supports: DuckDB, Snowflake, BigQuery, Postgres, Redshift, RisingWave, Databricks, Trino, MySQL, SQLite, Oracle

Resources


Ready to transform your data pipeline? Start with the examples above, experiment locally with DuckDB, and watch your productivity soar.

Share