This document outlines the recommended strategy for chunking documents and tagging metadata to support the retrieval engine defined in retrieval_pipeline_template.py.
Before any document can be used in the RAG pipeline, it must be broken into coherent, retrievable "chunks" and tagged with relevant metadata (e.g., source, topic). These chunks are then embedded and stored in ChromaDB via LlamaIndex's VectorStoreIndex.
Description:
A rule-based strategy that splits text based on character limits and logical separators like paragraphs or sentences.
Pros:
Cons:
Code Stub:
def recursive_chunking(text: str, chunk_size=300, chunk_overlap=50) -> List[str]:
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append(chunk.strip())
start += chunk_size - chunk_overlap
return chunks
Description:
Uses embedding similarity to find topic boundaries, attempting to split where the content meaning shifts.
Pros:
Cons:
Code Stub:
from sklearn.metrics.pairwise import cosine_similarity
def semantic_chunking(sentences: List[str], embeddings: List[np.ndarray], threshold=0.7) -> List[str]:
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0]
if sim < threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
Description:
Uses a language model to split the document into semantically meaningful sections with human-readable titles.
Pros:
Cons:
Code Stub:
import re
def agentic_chunking(text: str) -> List[Dict[str, str]]:
sections = re.split(r"#\s", text)[1:]
results = []
for section in sections:
lines = section.splitlines()
title = lines[0].strip()
content = "\n".join(lines[1:]).strip()
results.append({"title": title, "content": content})
return results
Each chunk should include metadata fields such as:
source: filename or document origintopic: derived from headers or chunk titlesection: optional, based on heading or LLM labelImplementation Note:
While LlamaIndex offers a MetadataExtractor, for v0.1 we may write our own lightweight tagging. A simple function can attach metadata during chunk generation, like so:
def tag_metadata(chunks: List[str], source: str) -> List[Dict[str, Any]]:
return [
{
"content": chunk,
"metadata": {
"source": source,
"topic": infer_topic(chunk) # optionally use a heuristic or LLM
}
}
for chunk in chunks
]
List[Dict[str, Any]] with content and metadata fields