The Hypocrisy Gap

This repository contains notebooks to reproduce the experiments and results for "The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders."

The repository is notebook-driven:

hypocrisy_gap.ipynb: Core notebook for computing truth alignment, explanation alignment, and the Hypocrisy Gap, and for reproducing the main evaluation results.
sae_hypocrisy_universal.ipynb: Experiments using custom training of task-specific SAEs.
Task_specific_SAE_fine_tuning.ipynb: Experiments using fine-tuned SAEs.

Each notebook is self-contained and can be run independently to reproduce the corresponding results reported in the paper.

Requirements

Python 3.12
PyTorch
Hugging Face transformers and datasets
SAELens
TransformerLens
scikit-learn

Exact versions and hardware details are described in the paper.

Usage

Open and run the notebooks directly. All requried dependencies and dataset loading is handled within the notebooks.

Notes

Experiments require access to internal model activations and pretrained or fine-tuned SAEs.
GPU access is strongly recommended for full reproduction.

The Hypocrisy Gap

Contents

Requirements

Usage

Notes