# The Hypocrisy Gap

This repository contains notebooks to reproduce the experiments and results for "The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders."

## Contents

The repository is notebook-driven:

- `hypocrisy_gap.ipynb`: Core notebook for computing truth alignment, explanation alignment, and the Hypocrisy Gap, and for reproducing the main evaluation results.

- `sae_hypocrisy_universal.ipynb`: Experiments using custom training of task-specific SAEs.

- `Task_specific_SAE_fine_tuning.ipynb`: Experiments using fine-tuned SAEs.

Each notebook is self-contained and can be run independently to reproduce the corresponding results reported in the paper.

## Requirements

- Python 3.12  
- PyTorch  
- Hugging Face `transformers` and `datasets`  
- SAELens  
- TransformerLens  
- scikit-learn  

Exact versions and hardware details are described in the paper.

## Usage

Open and run the notebooks directly. All requried dependencies and dataset loading is handled within the notebooks.

## Notes

- Experiments require access to internal model activations and pretrained or fine-tuned SAEs.  
- GPU access is strongly recommended for full reproduction.