Chunking Strategy for Vector Store Indexing

This document outlines the recommended strategy for chunking documents and tagging metadata to support the retrieval engine defined in retrieval_pipeline_template.py.

Overview

Before any document can be used in the RAG pipeline, it must be broken into coherent, retrievable "chunks" and tagged with relevant metadata (e.g., source, topic). These chunks are then embedded and stored in ChromaDB via LlamaIndex's VectorStoreIndex.

Chunking Strategies

1. Recursive Chunking

Description:
A rule-based strategy that splits text based on character limits and logical separators like paragraphs or sentences.

Pros:

Fast and deterministic
Easy to implement and configure
Works well for structured documents

Cons:

Can break in the middle of concepts or sections
No understanding of topic shifts

Code Stub:

def recursive_chunking(text: str, chunk_size=300, chunk_overlap=50) -> List[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start += chunk_size - chunk_overlap
    return chunks

2. Semantic Chunking

Description:
Uses embedding similarity to find topic boundaries, attempting to split where the content meaning shifts.

Pros:

More coherent chunks based on meaning
Useful for mixed-topic documents or transcripts

Cons:

Requires embeddings, slower than recursive
Needs tuning and threshold logic

Code Stub:

from sklearn.metrics.pairwise import cosine_similarity

def semantic_chunking(sentences: List[str], embeddings: List[np.ndarray], threshold=0.7) -> List[str]:
    chunks = []
    current_chunk = [sentences[0]]
    for i in range(1, len(sentences)):
        sim = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
        if sim < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    chunks.append(" ".join(current_chunk))
    return chunks

3. Agentic Chunking

Description:
Uses a language model to split the document into semantically meaningful sections with human-readable titles.

Pros:

Human-like segmentation
Generates section titles that can be reused as metadata

Cons:

Requires LLM calls (slow, potentially costly)
May be inconsistent without careful prompting

Code Stub:

import re

def agentic_chunking(text: str) -> List[Dict[str, str]]:
    sections = re.split(r"#\s", text)[1:]
    results = []
    for section in sections:
        lines = section.splitlines()
        title = lines[0].strip()
        content = "\n".join(lines[1:]).strip()
        results.append({"title": title, "content": content})
    return results

Metadata Tagging Strategy

Each chunk should include metadata fields such as:

source: filename or document origin
topic: derived from headers or chunk title
section: optional, based on heading or LLM label

Implementation Note:
While LlamaIndex offers a MetadataExtractor, for v0.1 we may write our own lightweight tagging. A simple function can attach metadata during chunk generation, like so:

def tag_metadata(chunks: List[str], source: str) -> List[Dict[str, Any]]:
    return [
        {
            "content": chunk,
            "metadata": {
                "source": source,
                "topic": infer_topic(chunk)  # optionally use a heuristic or LLM
            }
        }
        for chunk in chunks
    ]

Recommendation

Use recursive chunking as the default v0.1 strategy
Structure all chunkers to be modular and switchable via config
Align output format with Eddie’s ingestion pipeline: List[Dict[str, Any]] with content and metadata fields