Skip to content
CFOCoder

How I Understood Cube Core: Semantic Layer, Semantic Model, and Deterministic LLM Results

A didactic guide to remember how to build a semantic layer with Cube Core, DuckDB, and Contoso: YAML semantics, Python helper, notebook implementation, and LLM usage for deterministic analytics.

AI , Data 5 min read
Conceptual diagram of a semantic layer with Cube Core connecting an LLM, a Python helper, a YAML model, and a DuckDB database with Contoso data
Conceptual diagram of a semantic layer with Cube Core connecting an LLM, a Python helper, a YAML model, and a DuckDB database with Contoso data

This post is a note for my future self.

After working through a small example with Cube Core, DuckDB, and the Contoso dataset, I finally understood the difference between a semantic layer, a semantic model, a helper, and the actual role of an LLM when we want reliable numerical results.

The central idea is this:

The LLM must not be the source of numerical truth. The LLM interprets and explains. The semantic layer computes.

In this example I used Cube Core because I wanted to understand the architecture of a semantic layer locally, simply, and in a controlled way before taking the idea to larger scenarios like Snowflake, Power BI, or a formal thesis evaluation.

Cube Core is an open-source semantic layer that lets you define business metrics, dimensions, and relationships over a physical data source.

Instead of asking every user, notebook, or LLM to write raw SQL from scratch, Cube exposes a governed model:

Business question
-> governed metric
-> governed dimension
-> joins defined once
-> SQL generated by Cube
-> consistent result

In my lab I used this architecture:

flowchart LR
    A[Contoso Parquet files in MinIO] --> B[Local DuckDB]
    B --> C[Cube Core]
    D[Semantic model YAML] --> C
    C --> E[Cube REST API]
    E --> F[Python helper]
    F --> G[Jupyter Notebook]
    G --> H[Didactic explanation / LLM]
  
The lab architecture: Contoso data flows into DuckDB, Cube Core applies the semantic model, and the notebook or LLM consumes governed results.

The physical source ended up being a local DuckDB database:

/home/hectorsa/cube-core/data/tesis_duckdb_local.db

The Cube model lives at:

/home/hectorsa/cube-core/conf/model/ContosoDemo.yml

The notebook project is at:

/home/hectorsa/Documents/Cube_Semantic_Layer_Test

The clearest way I can remember it is this:

semantic layer = infrastructure that serves meaning
semantic model = star schema enriched with business rules

More concretely:

Semantic layer
= Cube Core running as a service
+ connection to DuckDB
+ query API
+ authentication
+ metadata endpoint
+ query execution
Semantic model
= facts
+ dimensions
+ measures
+ joins
+ business names
+ reusable formulas

In other words:

Cube Core is the platform.
The YAML is the model.
DuckDB is the data source.
The helper is the client.
The notebook is the lab.
The LLM is the linguistic interface.

Without a semantic layer, a seemingly simple question like:

What are the sales and margin by category and year?

forces you to write something like this:

SELECT
d.CalendarYear,
c.ProductCategoryName,
SUM(f.SalesAmount) AS total_sales_amount,
SUM(f.TotalCost) AS total_cost,
SUM(f.SalesAmount) - SUM(f.TotalCost) AS gross_margin
FROM contoso_FactSales f
JOIN contoso_DimDate d ON f.DateKey = d.DateKey
JOIN contoso_DimProduct p ON f.ProductKey = p.ProductKey
JOIN contoso_DimProductSubcategory sc ON p.ProductSubcategoryKey = sc.ProductSubcategoryKey
JOIN contoso_DimProductCategory c ON sc.ProductCategoryKey = c.ProductCategoryKey
GROUP BY 1, 2
ORDER BY total_sales_amount DESC
LIMIT 8;

That SQL works, but it has several fragilities:

  1. You must remember all the joins.
  2. You must remember the physical column names.
  3. You must know what SalesAmount means.
  4. You must repeat the gross margin formula every time.
  5. An LLM can make a mistake anywhere in the join or metric logic.

The semantic layer solves this by moving repeatable logic into a governed model.

The YAML file defines the meaning of the physical tables.

A simplified excerpt from the model looks like this:

cubes:
- name: FactSales
sql_table: main.contoso_FactSales
measures:
- name: totalSalesAmount
sql: SalesAmount
type: sum
title: Ventas
format: currency
- name: totalCost
sql: TotalCost
type: sum
title: Costo total
format: currency
- name: grossMargin
sql: '{totalSalesAmount} - {totalCost}'
type: number
title: Margen bruto
format: currency
- name: grossMarginPct
sql: '100.0 * ({totalSalesAmount} - {totalCost}) / NULLIF({totalSalesAmount}, 0)'
type: number
title: '% margen bruto'
dimensions:
- name: salesKey
sql: SalesKey
type: number
primary_key: true
- name: dateKey
sql: DateKey
type: time
- name: productKey
sql: ProductKey
type: number

Several important things happen here.

sql_table: main.contoso_FactSales

This tells Cube:

This cube is based on this physical DuckDB table.

Measures are governed metrics.

- name: totalSalesAmount
sql: SalesAmount
type: sum
title: Ventas

That means:

When someone asks for FactSales.totalSalesAmount,
Cube must compute SUM(SalesAmount).

The powerful part is that nobody has to reinvent that metric again.

Gross margin is not written as free SQL in the notebook. It is defined in the model:

- name: grossMargin
sql: '{totalSalesAmount} - {totalCost}'
type: number
title: Margen bruto

This is important for an LLM: the model already knows what gross margin means.

The LLM does not have to deduce it. It only has to ask for the correct metric.

Dimensions are analysis attributes.

For example, date:

- name: DimDate
sql_table: main.contoso_DimDate
dimensions:
- name: dateKey
sql: DateKey
type: time
primary_key: true
- name: calendarYear
sql: CalendarYear
type: number
- name: calendarMonthLabel
sql: CalendarMonthLabel
type: string

And product category:

- name: DimProductCategory
sql_table: main.contoso_DimProductCategory
dimensions:
- name: productCategoryKey
sql: ProductCategoryKey
type: number
primary_key: true
- name: productCategoryName
sql: ProductCategoryName
type: string

In simple language:

Measures answer: how much?
Dimensions answer: by what attribute do I group or filter?

Examples:

How much did we sell? -> measure
By year? -> dimension
By product category? -> dimension
What was the margin? -> measure

One of the most important parts of the YAML is the joins.

In the model, FactSales connects to date and product:

joins:
- name: DimDate
relationship: many_to_one
sql: '{CUBE}.DateKey = {DimDate.dateKey}'
- name: DimProduct
relationship: many_to_one
sql: '{CUBE}.ProductKey = {DimProduct.productKey}'

And product connects to subcategory:

joins:
- name: DimProductSubcategory
relationship: many_to_one
sql: '{CUBE}.ProductSubcategoryKey = {DimProductSubcategory.productSubcategoryKey}'

And subcategory connects to category:

joins:
- name: DimProductCategory
relationship: many_to_one
sql: '{CUBE}.ProductCategoryKey = {DimProductCategory.productCategoryKey}'

Visually, the model looks like this:

erDiagram
    FactSales }o--|| DimDate : DateKey
    FactSales }o--|| DimProduct : ProductKey
    DimProduct }o--|| DimProductSubcategory : ProductSubcategoryKey
    DimProductSubcategory }o--|| DimProductCategory : ProductCategoryKey
  
The semantic graph: the fact table connects to date and product dimensions, and Cube can traverse from product to category.

This was one of the most revealing parts: the YAML does not only define metrics, it also defines the path to reach dimensions.

That is why I can ask for:

FactSales.totalSalesAmount by DimProductCategory.productCategoryName

even though FactSales does not directly have the category name.

Cube knows how to traverse the graph:

FactSales
-> DimProduct
-> DimProductSubcategory
-> DimProductCategory

Without Cube, I write physical SQL.

With Cube, I write a semantic query:

{
"measures": ["FactSales.totalSalesAmount", "FactSales.grossMargin"],
"dimensions": ["DimProductCategory.productCategoryName", "DimDate.calendarYear"],
"order": {
"FactSales.totalSalesAmount": "desc"
},
"limit": 8
}

The difference is huge:

I no longer say how to do all the joins.
I say which metric I want and by which dimensions I want to see it.

Cube generates the correct SQL using the semantic model.

The helper is not the semantic layer.

The helper is just a local utility for calling Cube from Python without repeating HTTP, JWT token, and endpoint code.

In the project it lives here:

/home/hectorsa/Documents/Cube_Semantic_Layer_Test/scripts/cube_query.py

The minimal idea of the helper is this:

import base64, hashlib, hmac, http.client, json, subprocess, time
from typing import Any
CONTAINER = "cube-refresh-worker-yhqr23dw24o0ljf83ggecmge"
def _cube_secret() -> str:
# Reads CUBEJS_API_SECRET from the Cube container environment.
# The secret is not hardcoded in the notebook.
prefix = "CUBEJS_API_SECRET="
env_json = subprocess.check_output([
"docker", "inspect", CONTAINER, "--format", "{{json .Config.Env}}"
], text=True)
for item in json.loads(env_json):
if item.startswith(prefix):
return item.split("=", 1)[1]
raise RuntimeError("Could not find CUBEJS_API_SECRET in the Cube container")
def _cube_host_port() -> tuple[str, int]:
ip = subprocess.check_output([
"docker", "inspect", CONTAINER,
"--format", "{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}"
], text=True).strip()
if not ip:
raise RuntimeError("Could not resolve Docker IP for the Cube container")
return ip, 4000

Then it generates a JWT token to talk to Cube:

def cube_token(ttl_seconds: int = 3600) -> str:
secret = _cube_secret()
def enc(obj: dict[str, Any]) -> bytes:
return base64.urlsafe_b64encode(
json.dumps(obj, separators=(",", ":")).encode()
).rstrip(b"=")
header = enc({"alg": "HS256", "typ": "JWT"})
payload = enc({"iat": int(time.time()), "exp": int(time.time()) + ttl_seconds})
sig = base64.urlsafe_b64encode(
hmac.new(secret.encode(), header + b"." + payload, hashlib.sha256).digest()
).rstrip(b"=")
return (header + b"." + payload + b"." + sig).decode()

And exposes the two functions I actually use in the notebook:

def cube_load(query: dict[str, Any]) -> dict[str, Any]:
return cube_request("/cubejs-api/v1/load", "POST", {"query": query})
def cube_meta() -> dict[str, Any]:
return cube_request("/cubejs-api/v1/meta")

The analogy:

Cube Core = restaurant
Semantic model = menu with official recipes
Helper Python = waiter who knows how to ask the kitchen
Notebook / LLM = person asking questions
DuckDB = pantry / ingredients

The waiter does not cook. They only know how to order correctly.

In the notebook I did two things.

First I validated that the data existed:

con.sql("""
SELECT
(SELECT COUNT(*) FROM contoso_FactSales) AS fact_sales_rows,
(SELECT COUNT(*) FROM contoso_DimProduct) AS products,
(SELECT COUNT(*) FROM contoso_DimProductCategory) AS categories
""").df()

Result:

fact_sales_rows: 3,406,089
products: 2,233
categories: 8

Then I ran direct SQL to understand the physical logic:

sql = """
SELECT
d.CalendarYear,
c.ProductCategoryName,
SUM(f.SalesAmount) AS total_sales_amount,
SUM(f.TotalCost) AS total_cost,
SUM(f.SalesAmount) - SUM(f.TotalCost) AS gross_margin
FROM contoso_FactSales f
JOIN contoso_DimDate d ON f.DateKey = d.DateKey
JOIN contoso_DimProduct p ON f.ProductKey = p.ProductKey
JOIN contoso_DimProductSubcategory sc ON p.ProductSubcategoryKey = sc.ProductSubcategoryKey
JOIN contoso_DimProductCategory c ON sc.ProductCategoryKey = c.ProductCategoryKey
GROUP BY 1, 2
ORDER BY total_sales_amount DESC
LIMIT 8
"""
duckdb_sql_result = con.sql(sql).df()
duckdb_sql_result

After that I made the same query using Cube:

import sys
sys.path.append(str(Path('../scripts').resolve()))
from cube_query import cube_meta, cube_load
meta = cube_meta()

The metadata showed me the available metrics and dimensions:

FactSales measures:
- FactSales.count
- FactSales.totalSalesAmount
- FactSales.totalSalesQuantity
- FactSales.totalCost
- FactSales.grossMargin
- FactSales.grossMarginPct
DimProductCategory dimensions:
- DimProductCategory.productCategoryName
DimDate dimensions:
- DimDate.calendarYear
- DimDate.calendarMonthLabel
- DimDate.calendarQuarterLabel

Then the semantic query:

cube_query = {
'measures': ['FactSales.totalSalesAmount', 'FactSales.grossMargin'],
'dimensions': ['DimProductCategory.productCategoryName', 'DimDate.calendarYear'],
'order': {'FactSales.totalSalesAmount': 'desc'},
'limit': 8,
}
result = cube_load(cube_query)
cube_df = pd.DataFrame(result['data'])
cube_df

Cube result:

Home Appliances, 2008 totalSalesAmount = 1426891288.453333 grossMargin = 791977822.0032781
Home Appliances, 2007 totalSalesAmount = 1375117996.8142357 grossMargin = 759744791.8541427
Home Appliances, 2009 totalSalesAmount = 1120727501.9230988 grossMargin = 627761991.1030741
Cameras and camcorders, 2007 totalSalesAmount = 1098555433.3783784 grossMargin = 664774269.2083907
Computers, 2009 totalSalesAmount = 938926781.7259947 grossMargin = 537651024.2660043
Computers, 2007 totalSalesAmount = 894222614.5380026 grossMargin = 524712835.68800735
Cameras and camcorders, 2008 totalSalesAmount = 813888194.612105 grossMargin = 490078318.9321116
Computers, 2008 totalSalesAmount = 788289774.7310024 grossMargin = 459459357.281

The important thing is not only that the results match. The important thing is that the second approach no longer depends on the user remembering the physical SQL.

This is how the pieces connect:

flowchart TD
    A[Physical data: Contoso in DuckDB] --> B[Tables: FactSales, DimDate, DimProduct]
    B --> C[Cube YAML semantic model]
    C --> C1[Measures: totalSalesAmount, grossMargin]
    C --> C2[Dimensions: calendarYear, productCategoryName]
    C --> C3[Joins: FactSales -> Product -> Category]
    C1 --> D[Cube Core]
    C2 --> D
    C3 --> D
    D --> E[Cube REST API]
    E --> F[Python helper: cube_meta, cube_load]
    F --> G[Jupyter Notebook]
    G --> H[LLM explains the result]
  
The complete puzzle: physical data, YAML semantics, Cube Core, helper functions, notebook, and LLM explanation.

Another way to see it:

sequenceDiagram
    participant U as User
    participant L as LLM
    participant S as Skill / Guide
    participant H as Python helper
    participant C as Cube Core
    participant Y as YAML semantic model
    participant D as DuckDB

    U->>L: How much sales and margin by category and year?
    L->>S: How should I query Contoso?
    S-->>L: Use cube_meta and cube_load. Do not invent numbers.
    L->>H: cube_load(semantic query)
    H->>C: POST /cubejs-api/v1/load
    C->>Y: Reads measures, dimensions, and joins
    C->>D: Generates and executes SQL
    D-->>C: Numerical result
    C-->>H: JSON with data
    H-->>L: DataFrame / dict
    L-->>U: Explanation in natural language
  
Runtime flow: the LLM asks through a helper, Cube applies the YAML model, DuckDB computes, and the LLM explains the returned result.

Here it helped me to separate the concepts very clearly:

Semantic model / layer
= business rules + metrics + dimensions + joins + deterministic computation
Helper
= technical client for querying Cube without repeating HTTP/JWT/query boilerplate
Skill
= operational guide so the LLM knows how to use the helper and does not invent numbers
LLM
= interprets the question, calls the correct tool, and explains the result

The skill is not the semantic model.

The skill only tells the LLM how to behave.

For example, a skill for this lab should say something like:

# Contoso Cube Semantic Layer
## When to use
Use when the question involves sales, margin, costs, units, dates,
products, categories, or subcategories from the Contoso dataset.
## Main rule
Numerical results must come from Cube Core via `cube_load`.
Do not compute metrics manually unless the user explicitly asks for validation.
## Helper available
Path:
`/home/hectorsa/Documents/Cube_Semantic_Layer_Test/scripts/cube_query.py`
Functions:
- `cube_meta()` to inspect available metrics and dimensions.
- `cube_load(query)` to execute a semantic query against Cube.
## Mandatory flow
1. Consult metadata if metrics/dimensions are unknown.
2. Build a Cube JSON query.
3. Execute `cube_load(query)`.
4. Review the returned result.
5. Explain the result in natural language.
6. Do not invent values if Cube does not return them.

That disciplines the LLM.

Without a skill, the LLM might improvise SQL, assume columns, or invent a metric.

With a skill, the LLM knows it must use Cube.

It does not mean the LLM becomes deterministic in everything.

It means the numbers do not come from the LLM’s probabilistic reasoning.

They come from a controlled path:

Governed metric -> Semantic query -> Cube -> DuckDB -> Result

The LLM may vary in explanation, but it must not vary in computation.

This phrase summarizes the point:

The LLM does not act as an analytics engine nor as a source of numerical truth; it acts as a linguistic interface over a deterministic semantic layer.

For my thesis project, the idea is to prove that an LLM can reduce numerical errors when using a semantic layer/model instead of improvising.

The experimental design can be compared like this:

flowchart LR
    Q[Analytical question] --> A[LLM with no tools]
    Q --> B[LLM with free SQL]
    Q --> C[LLM with Cube semantic layer]

    A --> A1[Probable but unverifiable answer]
    B --> B1[May be correct, but may fail joins/rules]
    C --> C1[Result governed by model metrics]
  
Experimental framing for the thesis: compare an LLM without tools, an LLM with free SQL, and an LLM constrained to the semantic layer.

Three useful scenarios:

A. LLM without tools
Responds from reasoning or general knowledge.
High risk of numerical hallucination.
B. LLM with raw data access
Can generate SQL or pandas directly.
May be correct, but depends on understanding columns, joins, and rules.
C. LLM with skill + Cube semantic model
Uses governed metrics and dimensions.
Should produce higher accuracy and consistency.

Before, I mentally mixed these pieces:

semantic layer
semantic model
YAML
helper
LLM tool
notebook

Now I see them separated:

DuckDB
= physical base
Cube Core
= semantic layer
ContosoDemo.yml
= semantic model
cube_query.py
= technical helper
Jupyter Notebook
= didactic laboratory
Skill
= instructions so the LLM uses the helper correctly
LLM
= translator between natural language and semantic query

And the final flow is:

User asks in natural language
LLM interprets intent
Skill reminds the LLM how to use the helper
Helper queries Cube Core
Cube applies the semantic model
DuckDB/Contoso delivers data
Cube returns governed result
LLM explains the result

This small example with Contoso helped me because it does not try to solve the whole thesis at once.

It only shows the minimal puzzle:

1. Reduced Contoso dataset
2. DuckDB as local base
3. Cube Core as semantic layer
4. YAML as semantic model
5. Python helper as programmatic interface
6. Notebook as didactic demonstration
7. Skill as guide for the LLM

The most important idea I want to remember is:

Cube computes. The model governs. The helper queries. The skill guides. The LLM explains.

If I keep that separation, the thesis architecture becomes much clearer.