DNA methylation changes are reliable biomarkers of aging, but the mechanisms driving these changes remain poorly understood. Here we present SCARLET (Stem Cells and Age-ReLated Epigenetic Trajectories), a parsimonious mathematical model that explains how methylation changes arise and propagate through hematopoietic stem cell divisions. Using a large human cohort, we demonstrate that seemingly distinct temporal patterns of age-related methylation changes can be explained by a single general mechanistic model of stem cell dynamics. We show that SCARLET captures known drivers of biological aging, with individuals with accelerated epigenetic aging showing significantly reduced ratios of stem cell pool size to symmetric division rate (N/s). Applying SCARLET to methylation data from 11 mammalian species reveals that N/s scales with maximum lifespan, suggesting that evolutionary adjustments to stem cell dynamics, rather than epigenetic maintenance efficiency, drive the previously observed relationship between methylation rates and lifespan. Our findings provide a quantitative framework for understanding epigenetic aging and suggest that stem cell dynamics may be a key driver of aging across mammals.
This repository implements a mechanistic model of DNA methylation dynamics based on stem cell division processes (SCARLET). The model captures how methylation patterns change with age across different mammalian species and human cohorts, using PyMC for Bayesian inference. This code is an accompaniment to our paper "The Role of Stem Cell Dynamics in Epigenetic Aging".
The model describes methylation level Z(t) as a function of:
The mean methylation evolves as:
Z(t) = n + exp(-2stω)(p - n)
See src/general_imports.py for complete mathematical derivations including variance terms.
The main project scripts are split into 3 categories:
These are the scripts used to preprocess the AnnData objects (see above for details) to prepare them for analysis. Generally speaking, this means adding/calculating key variables for either the CpGs (e.g. mean methylation of a site) or the organism itself (e.g. maximum lifespan).
These are the scripts which run the various models.
These are the scripts which analyse the model runs. Generally speaking, these are the final scripts used to make the figures.
General package imports and re-used functions are stored within src/general_imports.py. Exports (e.g. model outputs, figures) are saved in exports. Data (e.g. the methyaltion AnnData objects) are stored within data.
See below for the the full repository structure:
├── data/ # Data files
│ └── example_anndata.h5ad # Example methylation data
├── env/ # Environment configuration
│ └── prolif_clock.yml # Conda environment specification
├── exports/ # Output directory
│ ├── figures/ # Generated plots
│ └── model_outputs/ # Model results and fits
├── notebooks/ # Analysis workflows
│ ├── 0_data_preprocessing/ # Data preparation scripts
│ ├── 1_model_runs/ # Model fitting scripts
│ └── 2_post_run_analyses/ # Post-processing and visualization
└── src/ # Source code
└── general_imports.py # Core functions and model definitions
To install and activate the conda environment (to run all code using CPUs), run:
conda env create -f env/prol_env.yml
conda activate prol_env
To run code on GPUs, the setup is more involved due to compatibility issues of packages with e.g. CUDA, and will depend on the system used and GPU software available. However, the packages remain the same as those used in the CPU setup with the addition of "jax". Additionally, any code run on GPUs should be able to be run on CPUs in theory (albeit much slower).
preprocessing_human_data.py
Preprocesses GenScot methylation data. Calculates CpG-level statistics including Spearman correlations, variance metrics, and regression coefficients. Adds computed statistics to AnnData object.
preprocessing_mammal_data.py
Preprocesses mammalian comparative methylation data across multiple species. Calculates CpG-level statistics and prepares data for cross-species modeling.
run_humans_fixed_n_s.py
Runs conditional SCARLET model on human data with fixed N (stem cells) and s (division rate) parameters. Relevant figures: Fig. 2a, Fig. 3a.
run_humans_cohorts_unconditional.py
Fits unconditional models allowing cohort-specific parameters. Relevant figures: Fig. 2c, Supp. Fig. 2a.
run_humans_trajectory_cats_fixed_n_s.py
Runs conditional SCARLET model on different categories of CpGs (by trajectory patterns). Includes comparisons with linear and null models. Relevant figures: Fig. 2b, Supp. Figs 1a-c.
run_humans_sensitivity_n_sites.py
Sensitivity analysis varying the number of CpG sites used in model fitting to assess robustness. Relevant figures: Supp. Fig. 2c.
run_humans_sensitivity_sample_size.py
Sensitivity analysis varying sample sizes to evaluate model stability and parameter estimation accuracy. Relevant figures: Supp. Fig. 2c.
run_humans_sensitivity_timespans.py
Sensitivity analysis examining model performance across different age ranges. Relevant figures: Supp. Figs 3a-b.
run_mammals_separate_models.py
Fits independent SCARLET models for each mammalian species to obtain species-specific parameter estimates. Relevant figures: Fig. 3b, Fig. 3c, Supp. Fig. 3c.
run_mammals_joint_models.py
Fits hierarchical SCARLET model with all mammals in a single joint model, sharing information across species. Relevant figures: Fig. 3d, Supp. Figs. 3d-i.
run_mouse_dog_fixed_n_s.py
Runs SCARLET model on mouse and dog data with fixed N and s parameters. Relevant figures: Fig. 3a.
analysis_humans.py
Comprehensive analysis of human GenScot data results. Generates heatmaps of log likelihoods across N and s, plots parameter distributions by group, analyzes site fits across CpG categories, and creates summary statistics tables. Relevant figures: Fig. 2a, Fig. 2b, Fig. 2c, Supp. Figs. 1a-c, Supp. Fig. 2a, Supp. Table 1
analysis_scaling.py
Cross-species scaling analysis. Plots N/s ratios vs. lifespan, examines methylation/demethylation probabilities across species, compares joint vs. separate models, and generates example site fits. Relevant figures: Fig. 3b, Fig. 3c, Fig. 3d, Supp. Figures 3c-i*
analysis_sensitivity.py
Analyzes and visualizes results from all sensitivity analyses (sample size, time spans, number of sites). Evaluates model robustness and parameter stability. Relevant figures: Supp. Figs 2b-c, 3a-b
analysis_mouse_human_heatmap_lineplot.py
Generates comparative visualizations between mouse and human methylation patterns, including heatmaps and trajectory line plots. Relevant figures: Fig. 3a
AnnData Structure:
AnnData object
.X # Methylation beta values (n_cpgs × n_samples)
.obs # CpG metadata (r², mean, variance, etc.)
.var # Sample metadata (age, cohort, species, etc.)
Please contact Sam Crofts (sam.crofts@ed.ac.uk) for further details.