Running the AML Demo

We recommend using the Acute Myeloid Leukemia (AML) dataset as a demo for running our model, as it is the smallest dataset used in our study. To adjust the number of training epochs, modify the model_config.py file. In our experiments, we used 500 epochs with early stopping, which may halt training earlier.

AML Model Configuration Files:

Follow the instructions to download the AML dataset, set up the paths of the experiment, preprocess the data and set up the splits.
Once the 5-fold splits are set up, the estimated runtime for all folds is:

scMEDAL-FE: ~8 minutes (500 epochs with early stopping)
scMEDAL-RE: ~24 minutes (500 epochs with early stopping)

These benchmarks were obtained using an Nvidia Tesla P4 GPU (8 GB memory).

Example: Running scMEDAL-FE

To run scMEDAL-FE, execute the following commands:

# Navigate to the model directory
cd /Experiments/AML/run_models/scMEDAL-FE

# Activate the environment
source activate /path/to/run_models_env

# Run the model for all folds
python run_scMEDAL-FE_allfolds.py

Additional Documentation

For more detailed instructions, refer to:

Experiment Reproducibility Guide

This guide provides instructions for reproducing the experiments described in our paper. It maps each section of the paper to the corresponding code and datasets, ensuring that you can run the models and scripts as described.

Experiments overview

We conducted experiments on three datasets: Healthy Heart, Autism Spectrum Disorder (ASD), and Acute Myeloid Leukemia (AML). For each dataset, we provide:

Preprocessing scripts and 5-fold cross-validation scripts to split the data in train, val and test.
Directories and scripts to run various models (e.g., Autoencoder, scMEDAL variants, Mixed Effects Classifier).
Scripts to generate and compare results (e.g., clustering scores, Genomaps, and UMAP plots).
Example configuration files (model_config.py) for each model, along with a list of variables and hyperparameters needed to reproduce our experiments. You can review these details here.

Note: Due to variability in TensorFlow, model outputs may differ slightly across runs. To account for this, we report 95% confidence intervals (CI) as an estimate of variability.

Healthy Heart Dataset

Data source:
The Healthy Heart dataset is available from Yu et al. (2023) at figshare.

Preprocessing

Models and scripts to reproduce results sections (RS)

RS 2.2: scMEDAL subnetworks create complementary batch-invariant and batch-specific latent spaces in the Healthy Heart dataset

RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL

Mixed Effects Classifier (MEC)

RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation

Comparison scripts

RS 2.2 and 2.7 Clustering scores
RS 2.5 Generate genomaps
RS 2.2 and 2.7 Generate UMAPs

For details on setting input and output paths for the Healthy Heart dataset, please refer to the [path setup instructions].

Autism Spectrum Disorder (ASD) dataset

Data source:
The ASD dataset can be accessed via the UCSC Cell Browser: https://autism.cells.ucsc.edu
(Speir et al., 2021; Velmeshev et al., 2019)

Preprocessing

Models and scripts to reproduce results sections

RS 2.3: scMEDAL's components reflect disease-associated neuronal patterns in ASD

RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL

Mixed Effects Classifier (MEC)

RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation

Autoencoder Classifier (AEC)
(Note: This link points to the Healthy Heart directory. Please ensure the correct path for ASD models.)
Fixed Effects Subnetwork with Cell Type Classifier (scMEDAL-FEC)
(Note: Similarly, ensure the correct ASD directory is used.)

Comparison scripts

RS 2.3 and 2.7 Clustering scores
RS 2.5 Generate genomaps and compute genomap statistics
RS 2.3 and 2.7 Generate UMAPs

Acute Myeloid Leukemia (AML) dataset

Data source:
The AML dataset is available at the Gene Expression Omnibus (GEO) under accession number GSE116256 (van Galen et al., 2019).

Data splits are available in AML_data.zip. We have included the 5 cross-validation splits metadata and the highly variable genes (HVGs) selected for this experiment.

You can either run the 5-fold cross-validation scripts or use the cell ids provided to generate splits.

Preprocessing

Models and Scripts to Reproduce Results Sections (RS)

RS 2.4 scMEDAL balances the trade-off between batch correction and cell type information preservation in leukemia

RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL

Mixed Effects Classifier (MEC)
- Cell Type Target
- Patient Group Target

RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation

Comparison scripts

RS 2.4 and 2.6 Clustering scores
RS 2.5 Generate genomaps and compute genomap statistics
RS 2.4 and 2.6Generate UMAPs

Note on variability

Due to the inherent variability in TensorFlow, results may differ slightly each time you train the models. To account for this, we have computed 95% confidence intervals to provide an estimate of the variability in model performance.

References

Litvinukova, M. et al. Cells of the adult human heart. Nature 588, 466-472 (2020).
van Galen, P. et al. Single-Cell RNA-Seq Reveals AML Hierarchies Relevant to Disease Progression and Immunity. Cell 176, 1265?1281.e24 (2019).
Velmeshev, D. et al. Single-cell genomics identifies cell type-specific molecular changes in autism. Science 364, 685?689 (2019).
Speir, M. L. et al. UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37, 4578?4580 (2021).
Yu, X., Xu, X., Zhang, J., & Li, X. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun 14, 960 (2023).
Yu, X., Xu, X., Zhang, J., & Li, X. Batch alignment of single-cell transcriptomics data using deep metric learning. figshare https://doi.org/10.6084/m9.figshare.20499630.v2 (2023).