This repository contains notebooks to reproduce the experiments and results for "The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders."
The repository is notebook-driven:
hypocrisy_gap.ipynb: Core notebook for computing truth alignment, explanation alignment, and the Hypocrisy Gap, and for reproducing the main evaluation results.
sae_hypocrisy_universal.ipynb: Experiments using custom training of task-specific SAEs.
Task_specific_SAE_fine_tuning.ipynb: Experiments using fine-tuned SAEs.
Each notebook is self-contained and can be run independently to reproduce the corresponding results reported in the paper.
transformers and datasetsExact versions and hardware details are described in the paper.
Open and run the notebooks directly. All requried dependencies and dataset loading is handled within the notebooks.