Building a Complete DuckLake Solution: From Local Development to Cloud Production

Introduction

DuckLake is revolutionizing the lakehouse architecture by combining the simplicity of DuckDB with the power of modern data lake formats. In this comprehensive guide, I’ll walk you through building a complete DuckLake solution in two parts: first creating a local development environment, then scaling it to a cloud-based production setup using MotherDuck, Cloudflare R2, and Supabase.

This tutorial demonstrates a real-world data pipeline that transforms transactional data through bronze, silver, and gold layers, showcasing the power and flexibility of the DuckLake ecosystem.

Table of Contents

Architecture Overview

Our solution implements a modern lakehouse architecture with:

  • Bronze Layer: Raw CSV data (local files and Cloudflare R2)
  • Silver Layer: Transformed Parquet files with business logic
  • Gold Layer: DuckLake tables optimized for analytics
  • Data Warehouse: SQLite (local) and MotherDuck (cloud)

Technology Stack Comparison

ComponentLocal SetupCloud Setup
Data WarehouseSQLite (dwh.db)MotherDuck
Compute EngineDuckDBMotherDuck + DuckDB
DuckLake MetadataDuckDBSupabase PostgreSQL
Bronze StorageLocal CSV filesCloudflare R2
Silver StorageLocal Parquet filesCloudflare R2
Gold StorageLocal Parquet filesCloudflare R2
Data TransformationDuckDB SQLIbis Framework + DuckDB
Secrets ManagementNot requiredDuckDB Persistent Secrets
AuthenticationLocal filesMotherDuck token, R2 keys, Supabase credentials

Part 1: Local DuckLake Implementation

Step 1: Setting Up the Data Warehouse with SQLite

We start by creating a simple transactional database using SQLite to simulate a traditional data warehouse:

-- 1_sqlite_commands.sql
sqlite3 dwh.db  

CREATE TABLE transactions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    transaction_date DATE,
    customer_id INTEGER,
    product_name TEXT,
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
);

INSERT INTO transactions (transaction_date, customer_id, product_name, quantity, unit_price, total_amount) VALUES
('2024-01-15', 1001, 'Laptop Pro', 1, 1299.99, 1299.99),
('2024-01-15', 1002, 'Wireless Mouse', 2, 29.99, 59.98),
('2024-01-16', 1003, 'USB Cable', 3, 9.99, 29.97),
('2024-01-16', 1004, 'Monitor 24"', 1, 299.99, 299.99),
('2024-01-17', 1005, 'Keyboard Mechanical', 1, 89.99, 89.99),
('2024-01-17', 1006, 'Headphones', 1, 149.99, 149.99),
('2024-01-18', 1007, 'Webcam HD', 1, 79.99, 79.99),
('2024-01-18', 1008, 'Phone Case', 2, 19.99, 39.98),
('2024-01-19', 1009, 'Tablet 10"', 1, 399.99, 399.99),
('2024-01-19', 1010, 'Charger USB-C', 1, 24.99, 24.99),
('2024-01-20', 1011, 'Speaker Bluetooth', 1, 59.99, 59.99),
('2024-01-20', 1012, 'Memory Card 64GB', 1, 39.99, 39.99),
('2024-01-21', 1013, 'Power Bank', 1, 49.99, 49.99),
('2024-01-21', 1014, 'Screen Protector', 3, 12.99, 38.97);

SELECT * FROM transactions;
.exit 

This creates our foundational transactional data that represents typical e-commerce sales data.

Step 2: Creating the Silver Layer with DuckDB

Next, we use DuckDB to connect to our SQLite database and perform transformations:

-- 2_duckdb_silver_commands.sql
duckdb

ATTACH 'dwh.db' AS sqlite_db (TYPE sqlite);
USE sqlite_db;

COPY (
SELECT 
    product_name,
    SUM(total_amount) AS Sales
FROM transactions
GROUP BY product_name
ORDER BY Sales DESC
) TO 'silver/sales.parquet' (FORMAT PARQUET);

.exit 

This transformation:

  • Connects DuckDB to our SQLite database
  • Aggregates sales data by product
  • Exports the results as a Parquet file in the silver layer

Step 3: Creating the DuckLake

Finally, we create our DuckLake using the DuckLake extension:

-- 3_duckdb_ducklake_commands.sql
duckdb 

INSTALL ducklake;
LOAD ducklake;

ATTACH 'ducklake:duckdb:ducklake.db' AS ducklake_db (DATA_PATH 'gold/'); 
USE ducklake_db;

CREATE TABLE customers AS 
FROM 'bronze/customers.csv';

CREATE TABLE inventory AS 
FROM 'bronze/inventory.csv';

CREATE TABLE vendors AS 
FROM 'bronze/vendors.csv';

CREATE TABLE sales AS 
FROM 'silver/sales.parquet';

SHOW TABLES;
SELECT * FROM sales;

.exit 

This creates our local DuckLake with:

  • Customer data from CSV files
  • Inventory and vendor information
  • Transformed sales data from the silver layer
  • All data automatically stored as optimized Parquet files in the gold folder

Querying the Local DuckLake

Despite the data being stored as Parquet files in the gold folder, we can query the DuckLake tables as if they were traditional database tables:

SELECT * FROM sales;

This demonstrates one of DuckLake’s key advantages: the ability to query distributed Parquet files through a unified table interface, providing the performance benefits of columnar storage with the familiar SQL experience of traditional databases.

Part 2: Cloud-Based DuckLake with MotherDuck, R2, and Supabase

Prerequisites: Setting Up DuckDB Secrets

Before working with cloud services, we need to configure our credentials:

-- duckdb_settings.sql
duckdb

.timer on

CREATE PERSISTENT SECRET r2_secret (
  TYPE R2,
  KEY_ID 'your_r2_key_id',
  SECRET 'your_r2_secret',
  ACCOUNT_ID 'your_r2_account_id'
);

CREATE PERSISTENT SECRET supabase_secret (
  TYPE postgres,
  HOST 'your_supabase_host',
  PORT 6543,
  DATABASE 'postgres',
  USER 'your_supabase_user',
  PASSWORD 'your_supabase_password'
);

SELECT secret_string FROM duckdb_secrets();
.exit 

Important: Make sure to set your MotherDuck token as an environment variable before proceeding.

Step 1: Creating the Cloud Data Warehouse in MotherDuck

We start by replicating our transactional data in MotherDuck:

import duckdb 
import ibis
from ibis import _

# Create connection to MotherDuck
dw = duckdb.connect("md:")
print("Connected successfully to MotherDuck!")

# Create a database (Schema) if it doesn't exist
schema_name = 'micho_db'

dw.sql("""
    CREATE DATABASE IF NOT EXISTS micho_db;
""")
print("Database created successfully.")

Step 2: Creating the Transactions Table

dw.sql("""
    USE micho_db;

    DROP TABLE IF EXISTS transactions;
    
    -- Create a sequence for auto-incrementing IDs
    CREATE SEQUENCE IF NOT EXISTS transaction_id_seq;
       
    CREATE TABLE IF NOT EXISTS transactions (
        id INTEGER DEFAULT nextval('transaction_id_seq') PRIMARY KEY,
        transaction_date TIMESTAMP,
        customer_id STRING,
        product_name STRING,
        quantity INTEGER,
        unit_price DOUBLE,
        total_amount DOUBLE
    );
""")
print("Table transactions created successfully with sequence-based auto-increment")

Step 3: Inserting Sample Data

dw.sql("""
    INSERT INTO transactions (transaction_date, customer_id, product_name, quantity, unit_price, total_amount)
    VALUES
    ('2024-01-15', '1001', 'Laptop Pro', 1, 1299.99, 1299.99),
    ('2024-01-15', '1002', 'Wireless Mouse', 2, 29.99, 59.98),
    ('2024-01-16', '1003', 'USB Cable', 3, 9.99, 29.97),
    ('2024-01-16', '1004', 'Monitor 24"', 1, 299.99, 299.99),
    ('2024-01-17', '1005', 'Keyboard Mechanical', 1, 89.99, 89.99),
    ('2024-01-17', '1006', 'Headphones', 1, 149.99, 149.99),
    ('2024-01-18', '1007', 'Webcam HD', 1, 79.99, 79.99),
    ('2024-01-18', '1008', 'Phone Case', 2, 19.99, 39.98),
    ('2024-01-19', '1009', 'Tablet 10"', 1, 399.99, 399.99),
    ('2024-01-19', '1010', 'Charger USB-C', 1, 24.99, 24.99),
    ('2024-01-20', '1011', 'Speaker Bluetooth', 1, 59.99, 59.99),
    ('2024-01-20', '1012', 'Memory Card 64GB', 1, 39.99, 39.99),
    ('2024-01-21', '1013', 'Power Bank', 1, 49.99, 49.99),
    ('2024-01-21', '1014', 'Screen Protector', 3, 12.99, 38.97);
""")

print(f"Successfully inserted records with auto-generated IDs!")

# Verify the data
t = dw.sql("SELECT * FROM micho_db.main.transactions;")
print(t)

# Close MotherDuck connection  
dw.close()

Step 4: Data Transformation with Ibis Framework

For data transformations, we use the Ibis framework which provides a more Pythonic approach:

# Create connection to MotherDuck database
con = ibis.duckdb.connect("md:")
print(f"Connected to MotherDuck successfully!")

# Execute the query on MotherDuck and pull the result into a Pandas DataFrame
silver = con.table('transactions', database='micho_db.main')
silver_transformed = silver.aggregate(
    total_quantity=_.quantity.sum(),
    average_unit_price=_.unit_price.mean(),
    total_amount=_.total_amount.sum(),
    by=['product_name']
).order_by(_.total_amount.desc())

silver_transformed_df = silver_transformed.to_pandas()
print(silver_transformed_df)

# Close the Ibis connection to MotherDuck
con.disconnect()

Step 5: Exporting to Cloudflare R2

We export our transformed data to Cloudflare R2 object storage:

# Export the Pandas DataFrame to a Parquet file in the R2 bucket
duckdb.sql("""
INSTALL httpfs;
LOAD httpfs;           
""")

duckdb.sql("COPY (SELECT * FROM silver_transformed_df) TO 'r2://ducklake/silver/silver_transformed.parquet' (FORMAT 'parquet', OVERWRITE);")
print("Exported Pandas Dataframe to R2 bucket successfully!")

Step 6: Creating the Cloud DuckLake in Supabase

Now we create our production DuckLake using Supabase as the metadata store:

# Connect to MotherDuck with DuckDB
con_lh = duckdb.connect('md:')
print("Connected successfully to MotherDuck!")

# Install necessary extensions
con_lh.install_extension('ducklake')
con_lh.load_extension('ducklake')

con_lh.install_extension('postgres')
con_lh.load_extension('postgres')

con_lh.install_extension('httpfs')
con_lh.load_extension('httpfs')

Step 7: Attaching to the DuckLake

# Attach to PostgreSQL DuckLake Database    
try:
    con_lh.sql(""" 
            ATTACH 'ducklake:postgres:
            host=your_supabase_host 
            port=6543 
            user=your_supabase_user 
            password=your_supabase_password 
            dbname=postgres' 
            AS ducklake1 (DATA_PATH 'r2://ducklake/gold/'); """
            )
    print("Attached to PostgreSQL DuckLake database successfully!")
except Exception as e:
    print(f"{e}")

con_lh.sql("USE ducklake1;")

Step 8: Loading Data into the Cloud DuckLake

Finally, we create our DuckLake tables from the data stored in R2:

# Create customers table from bronze data
try:
    con_lh.sql("""
        CREATE TABLE customers AS 
        SELECT * FROM 'r2://ducklake/bronze/customers.csv';
    """)
    print("Table customers created successfully from bronze folder!")
except Exception as e:
    print(f"Error creating table customers: {e}")

# Create vendors table
try:
    con_lh.sql("""
        CREATE TABLE vendors AS 
        SELECT * FROM 'r2://ducklake/bronze/vendors.csv';
    """)
    print("Table vendors created successfully from bronze folder!")
except Exception as e:
    print(f"Error creating table vendors: {e}")

# Create inventory table
try:
    con_lh.sql("""
    CREATE TABLE inventory AS 
    SELECT * FROM 'r2://ducklake/bronze/inventory.csv';
""")
    print("Table inventory created successfully from bronze folder!")
except Exception as e:
    print(f"Error creating table inventory: {e}")

# Create sales table from silver data
try:
    con_lh.sql("""
        CREATE TABLE sales AS 
        SELECT * FROM 'r2://ducklake/silver/silver_transformed.parquet';
    """)
    print("Table sales created successfully from silver folder!")
except Exception as e:
    print(f"Error creating table sales: {e}")

# Show all tables
con_lh.sql("SHOW TABLES;")

# Query the sales data
con_lh.sql("SELECT * FROM ducklake1.sales;")

# Close connection
con_lh.close()

Querying the Cloud DuckLake

Just like in the local setup, despite our data being distributed across Cloudflare R2 object storage as Parquet files, we can query the DuckLake tables seamlessly:

con_lh.sql("SELECT * FROM ducklake1.sales;")

This query demonstrates the power of DuckLake’s architecture: while the actual data resides in cost-effective object storage (R2), the metadata is managed in Supabase PostgreSQL, and we can query everything through a unified interface. The distributed storage is completely transparent to the user, providing both performance and cost optimization without sacrificing ease of use.

Lessons Learned

During the development of this DuckLake solution, I encountered several important technical challenges that provided valuable insights for future projects.

Initially, I assumed that since DuckLake worked seamlessly with DuckDB in my local setup, I could simply extend this to MotherDuck for the cloud implementation. However, I discovered that MotherDuck doesn’t support DuckLake metadata management in the same way. This forced me to pivot to using Supabase’s PostgreSQL database as the metadata store, which ultimately proved to be a more robust solution for production environments. This experience taught me that while MotherDuck excels as a compute engine for data transformations, transactional databases like PostgreSQL are better suited for DuckLake’s metadata management requirements.

Another significant learning came when working with data exports to Cloudflare R2. I initially attempted to use the COPY command through the Ibis framework to export pandas DataFrames directly to R2 object storage. However, I learned that Ibis doesn’t support the COPY command for external storage operations. This limitation required me to switch back to using DuckDB directly for the export operation. This experience reinforced an important architectural principle: use DuckDB directly for structural database operations (CREATE, DROP, COPY, etc.) and reserve Ibis for data transformations and analytical queries. The clear separation of concerns between these tools leads to more reliable and maintainable data pipelines.

A third important discovery involved the Supabase connection configuration. Following the MotherDuck tutorial, I initially tried to attach to the Supabase DuckLake using just the simple reference dbname=postgres, expecting it to work with the DuckDB persistent secrets I had configured. However, this approach failed to establish the connection. I learned that despite having the Supabase secret properly configured, the ATTACH command requires the full connection string with all parameters explicitly specified (host, port, user, password, and dbname). This might be a bug in the current implementation, as theoretically the secret reference should be sufficient, but in practice, the complete connection string is necessary for successful connectivity to Supabase PostgreSQL databases.

Key Benefits and Learnings

1. Seamless Integration

DuckLake provides seamless integration between different storage systems and compute engines. We can easily move between local development and cloud production environments.

2. Cost-Effective Storage

Using Cloudflare R2 for object storage provides significant cost savings compared to traditional cloud storage solutions, while maintaining high performance.

3. Flexible Metadata Management

Supabase provides a reliable and cost-effective PostgreSQL database for DuckLake metadata, enabling easy scaling and management.

4. Framework Flexibility

The combination of DuckDB SQL and Ibis Framework Python provides the best of both worlds – SQL for structural operations and Python for complex transformations.

5. Performance at Scale

MotherDuck allows us to leverage cloud compute power for data transformations while maintaining the simplicity of DuckDB.

Conclusion

This comprehensive tutorial demonstrates how to build a complete DuckLake solution from local development to cloud production. The architecture we’ve built provides:

  • Scalability: Easy to scale from local development to cloud production
  • Cost-Effectiveness: Using open-source tools and cost-effective cloud services
  • Flexibility: Supporting both SQL and Python workflows
  • Performance: Leveraging the power of DuckDB and cloud compute
  • Reliability: Using proven cloud services for storage and metadata management

The DuckLake ecosystem is rapidly evolving and provides an excellent foundation for modern data architectures. This setup gives you a solid foundation for building sophisticated data pipelines that can grow with your organization’s needs.

References

Next Steps

Consider exploring:

  • Advanced DuckLake features like time travel and versioning
  • Integration with other cloud providers
  • Implementing CI/CD pipelines for your data workflows
  • Adding data quality and monitoring tools
  • Exploring Delta Lake compatibility features
Share