Docling Chunking Tutorial: Preparing Documents for RAG

Table of Contents

Docling Chunkers: Overview and Comparison

Docling provides several chunkers for splitting documents into semantically meaningful pieces, each with different strategies and sophistication:

1. BaseChunker

  • Description: The most basic chunker. It splits documents into chunks based on a fixed number of tokens or characters, without considering document structure or semantics.
  • Use case: Simple, fast, but may break sentences or paragraphs.

2. HierarchicalChunker

  • Description: The default and most commonly used chunker. It leverages the document’s hierarchical structure (sections, headings, paragraphs) to create coherent chunks, preserving context and meaning.
  • Use case: Legal, academic, or structured documents where context and section boundaries matter.
  • Sophistication: More advanced than BaseChunker; maintains semantic coherence.

3. HybridChunker

  • Description: Combines hierarchical and sliding window approaches. It first tries to chunk by structure (like HierarchicalChunker), but if a section is too large, it applies a sliding window to further split it.
  • Use case: Very large or unevenly structured documents, or when you want both structure and size control.
  • Sophistication: The most advanced and flexible; adapts to both structure and token limits.

Summary Table:

ChunkerStructure-AwareSliding WindowSemantic CoherenceUse CaseSophistication
BaseChunkerNoNoLowSimple, unstructured docsBasic
HierarchicalChunkerYesNoHighLegal, academic, structured docsAdvanced
HybridChunkerYesYesVery HighLarge/complex docs, RAGMost Advanced

Recommendation:
For most legal/fiscal documents, use HierarchicalChunker for semantic coherence. For very large or complex documents, or when you need strict chunk size control, use HybridChunker—it is the most sophisticated and advanced option.

For more details, see the Docling Chunking Documentation.

This tutorial focuses specifically on document chunking using Docling’s HierarchicalChunker. We’ll use the Mexican Constitution (CPEUM) as our case study to demonstrate how to intelligently split documents for Retrieval-Augmented Generation (RAG) applications.

What is Document Chunking?

Document chunking is the process of breaking large documents into smaller, semantically meaningful pieces. This is essential for:

  • RAG Systems: Creating chunks for vector embeddings and retrieval
  • LLM Context Windows: Managing token limits when feeding text to language models
  • Semantic Search: Enabling more precise information retrieval
  • Performance: Balancing between context preservation and retrieval accuracy

Why Docling’s HierarchicalChunker?

Unlike simple text splitting (e.g., splitting every N characters), Docling’s chunker:

  • Respects Document Structure: Preserves sections, paragraphs, and logical boundaries
  • Semantic Coherence: Doesn’t split mid-sentence or mid-thought
  • Configurable: Control chunk size, overlap, and tokenization
  • Metadata Rich: Each chunk includes structural information
  • Table-Aware: Keeps tables intact when possible

Tutorial Structure

  1. Setup and document conversion
  2. Basic chunking with default settings
  3. Advanced chunking with custom parameters
  4. Chunk metadata and structure analysis
  5. Exporting chunks for RAG pipelines
  6. Best practices for legal documents
  7. Chunking strategy comparison
  8. Real-world RAG integration patterns

Section 1: Setup and Prerequisites

Before we start chunking, we need to:

  1. Install Docling and its dependencies
  2. Convert our source document (CPEUM) to a Docling document
  3. Verify the document structure

Prerequisites:

pip install docling docling-core
import os
import json
from pathlib import Path
from pprint import pprint

# Import Docling components
from docling.document_converter import DocumentConverter
from docling.chunking import HierarchicalChunker

print("✓ Dependencies imported successfully!")
print(f"✓ Ready to convert and chunk documents")
### 1.1 Convert the CPEUM PDF to Docling Document

# Define paths
pdf_path = Path("CPEUM.pdf")
output_dir = Path("chunks")
output_dir.mkdir(exist_ok=True)

# Verify file exists
if pdf_path.exists():
    print(f"✓ PDF file found: {pdf_path.name}")
    print(f"  File size: {pdf_path.stat().st_size / (1024*1024):.2f} MB")
else:
    print(f"✗ PDF file not found at {pdf_path}")

# Initialize converter and convert
print("\n" + "=" * 70)
print("CONVERTING CPEUM PDF")
print("=" * 70)

converter = DocumentConverter()
result = converter.convert(str(pdf_path))
doc = result.document

print(f"\n✓ Conversion Status: {result.status}")
print(f"✓ Number of pages: {len(result.pages)}")
print(f"✓ Document ready for chunking!")

Section 2: Basic Chunking with Default Settings

Let’s start with the simplest approach – using HierarchicalChunker with its default parameters.

### 2.1 Create Chunks with Default Parameters

print("=" * 70)
print("BASIC CHUNKING - DEFAULT PARAMETERS")
print("=" * 70)

# Initialize the chunker with default settings
chunker = HierarchicalChunker()

# Chunk the CPEUM document
chunks = list(chunker.chunk(doc))

print(f"\n✓ Total chunks created: {len(chunks)}")
print(f"✓ Chunker initialized with default parameters")

# Analyze the chunks
if chunks:
    chunk_lengths = [len(chunk.text) for chunk in chunks]
    avg_length = sum(chunk_lengths) / len(chunk_lengths)
    
    print(f"\n📊 Chunk Statistics:")
    print(f"  - Average chunk length: {avg_length:.0f} characters")
    print(f"  - Shortest chunk: {min(chunk_lengths)} characters")
    print(f"  - Longest chunk: {max(chunk_lengths)} characters")
    
    # Show the first chunk
    print(f"\n📄 First Chunk Preview:")
    first_chunk = chunks[0]
    print(f"  - Text length: {len(first_chunk.text)} characters")
    print(f"  - Text preview: {first_chunk.text[:200]}...")
    
    # Show chunk metadata
    if hasattr(first_chunk, 'meta'):
        print(f"\n🏷️  First Chunk Metadata:")
        print(f"  - Metadata: {first_chunk.meta}")

Section 3: Advanced Chunking with Custom Parameters

Now let’s explore how to fine-tune the chunking process with custom parameters to better suit your specific use case.

### 3.1 Custom Chunker Configuration

print("=" * 70)
print("ADVANCED CHUNKING - CUSTOM PARAMETERS")
print("=" * 70)

# Create a chunker with custom parameters
# These parameters control how the document is split
custom_chunker = HierarchicalChunker(
    max_tokens=512,           # Maximum tokens per chunk
    min_tokens=64,            # Minimum tokens per chunk (avoid tiny chunks)
    overlap_tokens=50,        # Overlap between chunks for context continuity
    tokenizer="text"          # Use simple whitespace tokenization
)

# Chunk the document with custom settings
custom_chunks = list(custom_chunker.chunk(doc))

print(f"\n✓ Chunks created with custom parameters: {len(custom_chunks)}")
print(f"\n⚙️  Custom Chunker Configuration:")
print(f"  - Max tokens: 512")
print(f"  - Min tokens: 64")
print(f"  - Overlap tokens: 50")
print(f"  - Tokenizer: text (whitespace-based)")

# Compare with default chunking
print(f"\n📊 Comparison:")
print(f"  - Default chunker: {len(chunks)} chunks")
print(f"  - Custom chunker: {len(custom_chunks)} chunks")
print(f"  - Difference: {abs(len(chunks) - len(custom_chunks))} chunks")

# Analyze custom chunks
if custom_chunks:
    custom_lengths = [len(chunk.text) for chunk in custom_chunks]
    avg_custom = sum(custom_lengths) / len(custom_lengths)
    
    print(f"\n📐 Custom Chunk Statistics:")
    print(f"  - Average length: {avg_custom:.0f} characters")
    print(f"  - Shortest: {min(custom_lengths)} characters")
    print(f"  - Longest: {max(custom_lengths)} characters")
    
    # Show a sample chunk
    print(f"\n📄 Sample Custom Chunk (chunk #5):")
    if len(custom_chunks) > 4:
        sample_chunk = custom_chunks[4]
        print(f"  - Length: {len(sample_chunk.text)} characters")
        print(f"  - Preview: {sample_chunk.text[:300]}...")

Section 4: Understanding Chunk Metadata and Structure

Each chunk created by Docling contains rich metadata about its position and structure in the original document. This is crucial for maintaining context in RAG applications.

### 4.1 Detailed Chunk Analysis

print("=" * 70)
print("CHUNK METADATA ANALYSIS")
print("=" * 70)

# Let's examine the structure and metadata of chunks in detail
print("\n🔍 Detailed Analysis of First 3 Chunks:")

for i, chunk in enumerate(custom_chunks[:3], 1):
    print(f"\n--- Chunk {i} ---")
    print(f"Text length: {len(chunk.text)} characters")
    print(f"Word count: {len(chunk.text.split())} words")
    
    # Show metadata if available
    if hasattr(chunk, 'meta'):
        print(f"Metadata: {chunk.meta}")
    
    # Show document location information if available
    if hasattr(chunk, 'dl_doc_hash'):
        print(f"Document hash: {chunk.dl_doc_hash}")
    
    if hasattr(chunk, 'page'):
        print(f"Page number: {chunk.page}")
    
    # Show text preview
    preview_length = min(150, len(chunk.text))
    print(f"Preview: {chunk.text[:preview_length]}...")
    
    # Show path information (hierarchical structure)
    if hasattr(chunk, 'path'):
        print(f"Path: {chunk.path}")

print("\n" + "=" * 70)

Section 5: Exporting Chunks for RAG Applications

Now let’s prepare our chunks in a format suitable for vector databases and RAG pipelines. This includes packaging the text with metadata in a structured JSON format.

### 5.1 Prepare Chunks for Vector Database Ingestion

print("=" * 70)
print("EXPORTING CHUNKS FOR RAG")
print("=" * 70)

# Prepare chunks for embedding and vector database storage
# This is the format typically used for RAG pipelines

chunks_for_rag = []

for i, chunk in enumerate(custom_chunks):
    chunk_data = {
        "chunk_id": i,
        "text": chunk.text,
        "char_count": len(chunk.text),
        "word_count": len(chunk.text.split()),
        "metadata": {
            "source": "CPEUM",
            "document_name": pdf_path.name,
        }
    }
    
    # Add optional metadata if available
    if hasattr(chunk, 'meta'):
        chunk_data["metadata"]["chunk_meta"] = str(chunk.meta)
    
    if hasattr(chunk, 'page'):
        chunk_data["metadata"]["page"] = chunk.page
    
    if hasattr(chunk, 'path'):
        chunk_data["metadata"]["path"] = str(chunk.path)
    
    chunks_for_rag.append(chunk_data)

print(f"\n✓ Prepared {len(chunks_for_rag)} chunks for RAG pipeline")

# Save chunks to JSON file (ready for vector database ingestion)
chunks_output_path = output_dir / "CPEUM_chunks_for_rag.json"
with open(chunks_output_path, "w", encoding="utf-8") as f:
    json.dump(chunks_for_rag, f, indent=2, ensure_ascii=False)

print(f"✓ Chunks saved to: {chunks_output_path}")

# Show statistics
total_chars = sum(c["char_count"] for c in chunks_for_rag)
total_words = sum(c["word_count"] for c in chunks_for_rag)

print(f"\n📊 RAG Chunk Statistics:")
print(f"  - Total chunks: {len(chunks_for_rag)}")
print(f"  - Total characters: {total_chars:,}")
print(f"  - Total words: {total_words:,}")
print(f"  - Average words per chunk: {total_words/len(chunks_for_rag):.0f}")

# Show sample chunk in RAG format
print(f"\n📄 Sample Chunk in RAG Format:")
print(json.dumps(chunks_for_rag[0], indent=2, ensure_ascii=False)[:500] + "...")

When working with legal documents like the Mexican Constitution, specific considerations apply to ensure optimal chunking for retrieval and comprehension.

6.1 Chunking Strategy Guidelines

Chunk Size Selection:

  • Small chunks (128-256 tokens): Better for precise retrieval, but may lose context
  • Medium chunks (256-512 tokens): Good balance for most RAG applications ✓
  • Large chunks (512-1024 tokens): More context, but less precise retrieval

Overlap Considerations:

  • No overlap: Clean boundaries, no duplication
  • Small overlap (32-64 tokens): Helps with context continuity ✓
  • Large overlap (128+ tokens): Better for cross-boundary concepts, but more storage
  1. Article Boundaries: Try to keep legal articles intact when possible
  2. Hierarchical Structure: Preserve section/subsection relationships in metadata
  3. Cross-References: Consider overlap to maintain references to other articles
  4. Tables and Lists: Keep structured content together in single chunks

RAG Pipeline Integration:

# Typical RAG workflow with Docling chunks:
# 1. Convert document
doc = converter.convert("legal_doc.pdf").document

# 2. Chunk with appropriate settings
chunker = HierarchicalChunker(max_tokens=512, overlap_tokens=64)
chunks = list(chunker.chunk(doc))

# 3. Generate embeddings (with your embedding model)
# embeddings = embedding_model.embed([chunk.text for chunk in chunks])

# 4. Store in vector database
# vector_db.insert(chunks, embeddings, metadata)

# 5. Query and retrieve relevant chunks
# results = vector_db.search(query_embedding, top_k=5)

Performance Tips:

  • Use tokenizer="text" for faster processing with simple documents
  • Adjust min_tokens to avoid creating tiny, unhelpful chunks
  • Test different chunk sizes with your specific retrieval task
  • Monitor chunk distribution to ensure even coverage

Section 7: Chunking Strategy Comparison

Let’s compare different chunking strategies side-by-side to understand their impact on the final output.

### 7.1 Compare Multiple Chunking Strategies

print("=" * 70)
print("CHUNKING STRATEGY COMPARISON")
print("=" * 70)

# Let's compare different chunking strategies for the CPEUM

strategies = [
    {"name": "Small Chunks", "max_tokens": 256, "min_tokens": 32, "overlap": 32},
    {"name": "Medium Chunks", "max_tokens": 512, "min_tokens": 64, "overlap": 64},
    {"name": "Large Chunks", "max_tokens": 1024, "min_tokens": 128, "overlap": 128},
    {"name": "No Overlap", "max_tokens": 512, "min_tokens": 64, "overlap": 0},
]

comparison_results = []

for strategy in strategies:
    chunker = HierarchicalChunker(
        max_tokens=strategy["max_tokens"],
        min_tokens=strategy["min_tokens"],
        overlap_tokens=strategy["overlap"],
        tokenizer="text"
    )
    
    strategy_chunks = list(chunker.chunk(doc))
    
    # Calculate statistics
    chunk_lengths = [len(chunk.text) for chunk in strategy_chunks]
    avg_length = sum(chunk_lengths) / len(chunk_lengths) if chunk_lengths else 0
    
    result = {
        "strategy": strategy["name"],
        "config": f"max:{strategy['max_tokens']}, min:{strategy['min_tokens']}, overlap:{strategy['overlap']}",
        "total_chunks": len(strategy_chunks),
        "avg_chars": int(avg_length),
        "min_chars": min(chunk_lengths) if chunk_lengths else 0,
        "max_chars": max(chunk_lengths) if chunk_lengths else 0,
    }
    
    comparison_results.append(result)

# Display comparison table
print("\n📊 Chunking Strategy Comparison Results:\n")
print(f"{'Strategy':<20} {'Config':<40} {'Chunks':<10} {'Avg Chars':<12} {'Min':<8} {'Max'}")
print("-" * 110)

for result in comparison_results:
    print(f"{result['strategy']:<20} {result['config']:<40} {result['total_chunks']:<10} "
          f"{result['avg_chars']:<12} {result['min_chars']:<8} {result['max_chars']}")

print("\n💡 Recommendations:")
print("  - Small Chunks: Use for precise retrieval in Q&A systems")
print("  - Medium Chunks: Best general-purpose setting for RAG")
print("  - Large Chunks: Use when context is critical (legal analysis)")
print("  - No Overlap: Use when storage efficiency is important")

Section 8: Real-World RAG Integration Patterns

Let’s explore practical examples of how to integrate Docling chunks into real RAG pipelines with popular vector databases and embedding models.

8.1 Integration Example: ChromaDB

# Example: Using Docling chunks with ChromaDB
import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB client
client = chromadb.Client()

# Create or get collection
collection = client.create_collection(
    name="cpeum_legal_docs",
    embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction()
)

# Add chunks to ChromaDB
for i, chunk_data in enumerate(chunks_for_rag):
    collection.add(
        documents=[chunk_data["text"]],
        metadatas=[chunk_data["metadata"]],
        ids=[f"chunk_{i}"]
    )

# Query the collection
results = collection.query(
    query_texts=["¿Cuáles son los derechos humanos en México?"],
    n_results=5
)

8.2 Integration Example: Pinecone

# Example: Using Docling chunks with Pinecone
import pinecone
from sentence_transformers import SentenceTransformer

# Initialize Pinecone
pinecone.init(api_key="your-api-key", environment="your-environment")
index = pinecone.Index("cpeum-index")

# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed and upsert chunks
for chunk_data in chunks_for_rag:
    embedding = model.encode(chunk_data["text"]).tolist()
    index.upsert([(
        f"chunk_{chunk_data['chunk_id']}",
        embedding,
        chunk_data["metadata"]
    )])

# Query
query_embedding = model.encode("derechos humanos").tolist()
results = index.query(query_embedding, top_k=5, include_metadata=True)

8.3 Integration Example: LangChain

# Example: Using Docling chunks with LangChain
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document

# Convert chunks to LangChain Documents
langchain_docs = [
    Document(
        page_content=chunk["text"],
        metadata=chunk["metadata"]
    )
    for chunk in chunks_for_rag
]

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=langchain_docs,
    embedding=embeddings,
    collection_name="cpeum_collection"
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Use in RAG chain
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=retriever,
    return_source_documents=True
)

# Query
result = qa_chain({"query": "¿Qué dice la constitución sobre los derechos humanos?"})

8.4 Integration Example: LlamaIndex

# Example: Using Docling chunks with LlamaIndex
from llama_index import VectorStoreIndex, Document
from llama_index.embeddings import OpenAIEmbedding

# Convert chunks to LlamaIndex Documents
llama_docs = [
    Document(
        text=chunk["text"],
        metadata=chunk["metadata"],
        doc_id=f"chunk_{chunk['chunk_id']}"
    )
    for chunk in chunks_for_rag
]

# Create index
index = VectorStoreIndex.from_documents(
    llama_docs,
    embed_model=OpenAIEmbedding()
)

# Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("¿Cuáles son las garantías individuales?")

Section 9: Summary and Key Takeaways

What We’ve Covered:

  1. Setup and Conversion – Converting documents to Docling format
  2. Basic Chunking – Using HierarchicalChunker with default settings
  3. Advanced Configuration – Custom parameters (tokens, overlap, tokenizer)
  4. Metadata Analysis – Understanding chunk structure and metadata
  5. RAG Export – Preparing chunks in JSON format for vector databases
  6. Best Practices – Legal document-specific recommendations
  7. Strategy Comparison – Side-by-side analysis of different approaches
  8. Real-World Integration – Examples with popular RAG frameworks

Key Parameters:

ParameterPurposeRecommended Values
max_tokensMaximum chunk size256-512 for general use, 512-1024 for legal
min_tokensMinimum chunk size32-64 (avoid tiny chunks)
overlap_tokensContext continuity32-64 for balance, 0 for efficiency
tokenizerTokenization method“text” for speed, default for accuracy

Chunking Strategy Decision Tree:

Is precision more important than context?
├─ YES → Use small chunks (256 tokens, 32 overlap)
└─ NO → Use medium/large chunks (512-1024 tokens, 64-128 overlap)

Is storage/cost a concern?
├─ YES → No overlap (overlap_tokens=0)
└─ NO → Use overlap (50-64 tokens)

Do you have structured legal content?
├─ YES → Larger chunks to preserve article boundaries
└─ NO → Standard medium chunks

Next Steps:

  1. Experiment: Test different chunk sizes with your specific use case
  2. Measure: Track retrieval accuracy and relevance with different strategies
  3. Integrate: Connect chunks to your vector database of choice
  4. Optimize: Monitor performance and adjust parameters accordingly
  5. Scale: Use batch processing for multiple documents

Additional Resources:

  • Docling Documentation: https://docling-project.github.io/docling/
  • Chunking Guide: See examples in the Docling repository
  • RAG Best Practices: LangChain and LlamaIndex documentation
  • Vector Databases: ChromaDB, Pinecone, Weaviate, Qdrant documentation
  • This Project’s Github Repo: https://github.com/cfocoder/Docling-Mexican-Constitution

Congratulations! You now have a comprehensive understanding of document chunking with Docling and are ready to build production-grade RAG applications for legal documents and beyond.

Share