SCARLET
README.md

The Role of Stem Cell Dynamics in Epigenetic Aging

DNA methylation changes are reliable biomarkers of aging, but the mechanisms driving these changes remain poorly understood. Here we present SCARLET (Stem Cells and Age-ReLated Epigenetic Trajectories), a parsimonious mathematical model that explains how methylation changes arise and propagate through hematopoietic stem cell divisions. Using a large human cohort, we demonstrate that seemingly distinct temporal patterns of age-related methylation changes can be explained by a single general mechanistic model of stem cell dynamics. We show that SCARLET captures known drivers of biological aging, with individuals with accelerated epigenetic aging showing significantly reduced ratios of stem cell pool size to symmetric division rate (N/s). Applying SCARLET to methylation data from 11 mammalian species reveals that N/s scales with maximum lifespan, suggesting that evolutionary adjustments to stem cell dynamics, rather than epigenetic maintenance efficiency, drive the previously observed relationship between methylation rates and lifespan. Our findings provide a quantitative framework for understanding epigenetic aging and suggest that stem cell dynamics may be a key driver of aging across mammals.​​​​

Overview

This repository implements a mechanistic model of DNA methylation dynamics based on stem cell division processes (SCARLET). The model captures how methylation patterns change with age across different mammalian species and human cohorts, using PyMC for Bayesian inference. This code is an accompaniment to our paper "The Role of Stem Cell Dynamics in Epigenetic Aging".

Mathematical Model

The model describes methylation level Z(t) as a function of:

  • N: Number of stem cells
  • s: Division rate per stem cell per year
  • Pm (PM->U): Probability (per stem cell division) of a CpG site changing from methylated to unmethylated
  • Pu (PU->M): Probability (per stem cell division) of a CpG site changing from unmethylated to methylated
  • n (η): Theoretical equilibrium methylation level (Pu/(Pm + Pu))
  • w (ω): Combined methylation/demethylation probability (Pm + Pu)
  • p: Initial methylation level at t=0

The mean methylation evolves as:

Z(t) = n + exp(-2stω)(p - n)

See src/general_imports.py for complete mathematical derivations including variance terms.

Project layout

The main project scripts are split into 3 categories:

1) Preprocessing scripts (prefix: "preprocessing")

These are the scripts used to preprocess the AnnData objects (see above for details) to prepare them for analysis. Generally speaking, this means adding/calculating key variables for either the CpGs (e.g. mean methylation of a site) or the organism itself (e.g. maximum lifespan).

2) Running scripts (prefix: "run")

These are the scripts which run the various models.

3) Analysis scripts (prefix: "analysis")

These are the scripts which analyse the model runs. Generally speaking, these are the final scripts used to make the figures.

Other files and folders

General package imports and re-used functions are stored within src/general_imports.py. Exports (e.g. model outputs, figures) are saved in exports. Data (e.g. the methyaltion AnnData objects) are stored within data.

See below for the the full repository structure:

├── data/                           # Data files
│   └── example_anndata.h5ad       # Example methylation data
├── env/                           # Environment configuration
│   └── prolif_clock.yml          # Conda environment specification
├── exports/                       # Output directory
│   ├── figures/                  # Generated plots
│   └── model_outputs/            # Model results and fits
├── notebooks/                     # Analysis workflows
│   ├── 0_data_preprocessing/     # Data preparation scripts
│   ├── 1_model_runs/             # Model fitting scripts
│   └── 2_post_run_analyses/      # Post-processing and visualization
└── src/                          # Source code
    └── general_imports.py        # Core functions and model definitions

Installation

To install and activate the conda environment (to run all code using CPUs), run:

conda env create -f env/prol_env.yml
conda activate prol_env

To run code on GPUs, the setup is more involved due to compatibility issues of packages with e.g. CUDA, and will depend on the system used and GPU software available. However, the packages remain the same as those used in the CPU setup with the addition of "jax". Additionally, any code run on GPUs should be able to be run on CPUs in theory (albeit much slower).

Key Dependencies

  • PyMC 5.5.0 - Probabilistic programming
  • PyTensor - Backend for automatic differentiation
  • NumPyro/JAX - Alternative MCMC sampling
  • AnnData - Methylation data storage
  • ArviZ - Bayesian model diagnostics
  • Pandas, NumPy - Data manipulation
  • Matplotlib, Seaborn, Plotly - Visualization

Description of main scripts:

0. Data Preprocessing

preprocessing_human_data.py
Preprocesses GenScot methylation data. Calculates CpG-level statistics including Spearman correlations, variance metrics, and regression coefficients. Adds computed statistics to AnnData object.

preprocessing_mammal_data.py
Preprocesses mammalian comparative methylation data across multiple species. Calculates CpG-level statistics and prepares data for cross-species modeling.

1. Model Runs

run_humans_fixed_n_s.py
Runs conditional SCARLET model on human data with fixed N (stem cells) and s (division rate) parameters. Relevant figures: Fig. 2a, Fig. 3a.

run_humans_cohorts_unconditional.py
Fits unconditional models allowing cohort-specific parameters. Relevant figures: Fig. 2c, Supp. Fig. 2a.

run_humans_trajectory_cats_fixed_n_s.py
Runs conditional SCARLET model on different categories of CpGs (by trajectory patterns). Includes comparisons with linear and null models. Relevant figures: Fig. 2b, Supp. Figs 1a-c.

run_humans_sensitivity_n_sites.py
Sensitivity analysis varying the number of CpG sites used in model fitting to assess robustness. Relevant figures: Supp. Fig. 2c.

run_humans_sensitivity_sample_size.py
Sensitivity analysis varying sample sizes to evaluate model stability and parameter estimation accuracy. Relevant figures: Supp. Fig. 2c.

run_humans_sensitivity_timespans.py
Sensitivity analysis examining model performance across different age ranges. Relevant figures: Supp. Figs 3a-b.

run_mammals_separate_models.py
Fits independent SCARLET models for each mammalian species to obtain species-specific parameter estimates. Relevant figures: Fig. 3b, Fig. 3c, Supp. Fig. 3c.

run_mammals_joint_models.py
Fits hierarchical SCARLET model with all mammals in a single joint model, sharing information across species. Relevant figures: Fig. 3d, Supp. Figs. 3d-i.

run_mouse_dog_fixed_n_s.py
Runs SCARLET model on mouse and dog data with fixed N and s parameters. Relevant figures: Fig. 3a.

2. Post-Run Analyses

analysis_humans.py
Comprehensive analysis of human GenScot data results. Generates heatmaps of log likelihoods across N and s, plots parameter distributions by group, analyzes site fits across CpG categories, and creates summary statistics tables. Relevant figures: Fig. 2a, Fig. 2b, Fig. 2c, Supp. Figs. 1a-c, Supp. Fig. 2a, Supp. Table 1

analysis_scaling.py
Cross-species scaling analysis. Plots N/s ratios vs. lifespan, examines methylation/demethylation probabilities across species, compares joint vs. separate models, and generates example site fits. Relevant figures: Fig. 3b, Fig. 3c, Fig. 3d, Supp. Figures 3c-i*

analysis_sensitivity.py
Analyzes and visualizes results from all sensitivity analyses (sample size, time spans, number of sites). Evaluates model robustness and parameter stability. Relevant figures: Supp. Figs 2b-c, 3a-b

analysis_mouse_human_heatmap_lineplot.py
Generates comparative visualizations between mouse and human methylation patterns, including heatmaps and trajectory line plots. Relevant figures: Fig. 3a

Data Format

AnnData Structure:

AnnData object
  .X          # Methylation beta values (n_cpgs × n_samples)
  .obs        # CpG metadata (r², mean, variance, etc.)
  .var        # Sample metadata (age, cohort, species, etc.)

Contact

Please contact Sam Crofts (sam.crofts@ed.ac.uk) for further details.