# The Hypocrisy Gap This repository contains notebooks to reproduce the experiments and results for "The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders." ## Contents The repository is notebook-driven: - `hypocrisy_gap.ipynb`: Core notebook for computing truth alignment, explanation alignment, and the Hypocrisy Gap, and for reproducing the main evaluation results. - `sae_hypocrisy_universal.ipynb`: Experiments using custom training of task-specific SAEs. - `Task_specific_SAE_fine_tuning.ipynb`: Experiments using fine-tuned SAEs. Each notebook is self-contained and can be run independently to reproduce the corresponding results reported in the paper. ## Requirements - Python 3.12 - PyTorch - Hugging Face `transformers` and `datasets` - SAELens - TransformerLens - scikit-learn Exact versions and hardware details are described in the paper. ## Usage Open and run the notebooks directly. All requried dependencies and dataset loading is handled within the notebooks. ## Notes - Experiments require access to internal model activations and pretrained or fine-tuned SAEs. - GPU access is strongly recommended for full reproduction.