# Chunking Strategy for Vector Store Indexing This document outlines the recommended strategy for chunking documents and tagging metadata to support the retrieval engine defined in `retrieval_pipeline_template.py`. ## Overview Before any document can be used in the RAG pipeline, it must be broken into coherent, retrievable "chunks" and tagged with relevant metadata (e.g., source, topic). These chunks are then embedded and stored in ChromaDB via LlamaIndex's `VectorStoreIndex`. ## Chunking Strategies ### 1. Recursive Chunking **Description:** A rule-based strategy that splits text based on character limits and logical separators like paragraphs or sentences. **Pros:** - Fast and deterministic - Easy to implement and configure - Works well for structured documents **Cons:** - Can break in the middle of concepts or sections - No understanding of topic shifts **Code Stub:** ```python def recursive_chunking(text: str, chunk_size=300, chunk_overlap=50) -> List[str]: chunks = [] start = 0 while start < len(text): end = min(start + chunk_size, len(text)) chunk = text[start:end] chunks.append(chunk.strip()) start += chunk_size - chunk_overlap return chunks ``` --- ### 2. Semantic Chunking **Description:** Uses embedding similarity to find topic boundaries, attempting to split where the content meaning shifts. **Pros:** - More coherent chunks based on meaning - Useful for mixed-topic documents or transcripts **Cons:** - Requires embeddings, slower than recursive - Needs tuning and threshold logic **Code Stub:** ```python from sklearn.metrics.pairwise import cosine_similarity def semantic_chunking(sentences: List[str], embeddings: List[np.ndarray], threshold=0.7) -> List[str]: chunks = [] current_chunk = [sentences[0]] for i in range(1, len(sentences)): sim = cosine_similarity([embeddings[i-1]], [embeddings[i]])[0][0] if sim < threshold: chunks.append(" ".join(current_chunk)) current_chunk = [sentences[i]] else: current_chunk.append(sentences[i]) chunks.append(" ".join(current_chunk)) return chunks ``` --- ### 3. Agentic Chunking **Description:** Uses a language model to split the document into semantically meaningful sections with human-readable titles. **Pros:** - Human-like segmentation - Generates section titles that can be reused as metadata **Cons:** - Requires LLM calls (slow, potentially costly) - May be inconsistent without careful prompting **Code Stub:** ```python import re def agentic_chunking(text: str) -> List[Dict[str, str]]: sections = re.split(r"#\s", text)[1:] results = [] for section in sections: lines = section.splitlines() title = lines[0].strip() content = "\n".join(lines[1:]).strip() results.append({"title": title, "content": content}) return results ``` --- ## Metadata Tagging Strategy Each chunk should include metadata fields such as: - `source`: filename or document origin - `topic`: derived from headers or chunk title - `section`: optional, based on heading or LLM label **Implementation Note:** While LlamaIndex offers a `MetadataExtractor`, for v0.1 we may write our own lightweight tagging. A simple function can attach metadata during chunk generation, like so: ```python def tag_metadata(chunks: List[str], source: str) -> List[Dict[str, Any]]: return [ { "content": chunk, "metadata": { "source": source, "topic": infer_topic(chunk) # optionally use a heuristic or LLM } } for chunk in chunks ] ``` --- ## Recommendation - Use **recursive chunking** as the default v0.1 strategy - Structure all chunkers to be modular and switchable via config - Align output format with Eddie’s ingestion pipeline: `List[Dict[str, Any]]` with `content` and `metadata` fields