hypocrisy-gap / README.md
README.md
Raw

The Hypocrisy Gap

This repository contains notebooks to reproduce the experiments and results for "The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders."

Contents

The repository is notebook-driven:

  • hypocrisy_gap.ipynb: Core notebook for computing truth alignment, explanation alignment, and the Hypocrisy Gap, and for reproducing the main evaluation results.

  • sae_hypocrisy_universal.ipynb: Experiments using custom training of task-specific SAEs.

  • Task_specific_SAE_fine_tuning.ipynb: Experiments using fine-tuned SAEs.

Each notebook is self-contained and can be run independently to reproduce the corresponding results reported in the paper.

Requirements

  • Python 3.12
  • PyTorch
  • Hugging Face transformers and datasets
  • SAELens
  • TransformerLens
  • scikit-learn

Exact versions and hardware details are described in the paper.

Usage

Open and run the notebooks directly. All requried dependencies and dataset loading is handled within the notebooks.

Notes

  • Experiments require access to internal model activations and pretrained or fine-tuned SAEs.
  • GPU access is strongly recommended for full reproduction.