We recommend using the Acute Myeloid Leukemia (AML) dataset as a demo for running our model, as it is the smallest dataset used in our study. To adjust the number of training epochs, modify the model_config.py
file. In our experiments, we used 500 epochs with early stopping, which may halt training earlier.
Follow the instructions to download the AML dataset, set up the paths of the experiment, preprocess the data and set up the splits.
Once the 5-fold splits are set up, the estimated runtime for all folds is:
These benchmarks were obtained using an Nvidia Tesla P4 GPU (8 GB memory).
To run scMEDAL-FE, execute the following commands:
# Navigate to the model directory
cd /Experiments/AML/run_models/scMEDAL-FE
# Activate the environment
source activate /path/to/run_models_env
# Run the model for all folds
python run_scMEDAL-FE_allfolds.py
For more detailed instructions, refer to:
This guide provides instructions for reproducing the experiments described in our paper. It maps each section of the paper to the corresponding code and datasets, ensuring that you can run the models and scripts as described.
We conducted experiments on three datasets: Healthy Heart, Autism Spectrum Disorder (ASD), and Acute Myeloid Leukemia (AML). For each dataset, we provide:
model_config.py
) for each model, along with a list of variables and hyperparameters needed to reproduce our experiments. You can review these details here.Note: Due to variability in TensorFlow, model outputs may differ slightly across runs. To account for this, we report 95% confidence intervals (CI) as an estimate of variability.
Data source:
The Healthy Heart dataset is available from Yu et al. (2023) at figshare.
RS 2.2: scMEDAL subnetworks create complementary batch-invariant and batch-specific latent spaces in the Healthy Heart dataset
RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL
RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation
For details on setting input and output paths for the Healthy Heart dataset, please refer to the [path setup instructions].
Data source:
The ASD dataset can be accessed via the UCSC Cell Browser: https://autism.cells.ucsc.edu
(Speir et al., 2021; Velmeshev et al., 2019)
RS 2.3: scMEDAL's components reflect disease-associated neuronal patterns in ASD
RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL
RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation
Data source:
The AML dataset is available at the Gene Expression Omnibus (GEO) under accession number GSE116256 (van Galen et al., 2019).
Data splits are available in AML_data.zip. We have included the 5 cross-validation splits metadata and the highly variable genes (HVGs) selected for this experiment.
You can either run the 5-fold cross-validation scripts or use the cell ids provided to generate splits.
RS 2.4 scMEDAL balances the trade-off between batch correction and cell type information preservation in leukemia
RS 2.6: Improved cell classification accuracy using complementary latent spaces of scMEDAL
RS 2.7: The AE classifier, scMEDAL-FEC, enhances cell type preservation
Due to the inherent variability in TensorFlow, results may differ slightly each time you train the models. To account for this, we have computed 95% confidence intervals to provide an estimate of the variability in model performance.