# **scMEDAL: Mixed Effects Deep Autoencoder Learning Framework** ### Reproducing Our Experiments For details on reproducing our experiments, see **[Experiment Reproducibility Guide](./docs/ExperimentsReproducibility.md#experiment-reproducibility-guide)**. We recommend using the **Acute Myeloid Leukemia** dataset as a **DEMO** to run our model, as it is the smallest dataset used in our paper. See **[Running the AML Demo](./docs/ExperimentsReproducibility.md#running-the-aml-demo)**. Our documentation includes: - **Instructions on setting up experiments:** See **[How to Set Up Your Experiment](./docs/How2SetupYourExpt.md)**. - **Guidance on analyzing and interpreting model outputs:** See **[How to Analyze Your Model Outputs](./docs/How2AnalyzeYourModelOutputs.md)**. - **Step-by-step instructions for running experiments:** See **[How to Run Your Experiment](./docs/How2RunYourExpt.md)**. - **Details on experiment outputs:** See **[Experiment Outputs](./docs/ExperimentOutputs.md)**. --- ## **Overview** The single-cell Mixed Effects Deep Autoencoder Learning (**scMEDAL**) framework provides a robust approach to analyze single-cell RNA sequencing (scRNA-seq) data. By disentangling batch-invariant from batch-specific signals, scMEDAL offers a more interpretable representation of complex datasets. ![scMEDAL Diagram](./docs/images/scMEDAL.png) --- ## **1. Framework Overview** ### **Fixed Effects Subnetwork (scMEDAL-FE)** - Captures features that remain consistent across batches. - Uses adversarial learning to minimize batch label predictability, ensuring batch-invariant latent representations. ### **Random Effects Subnetwork (scMEDAL-RE)** - Models batch-specific variability using variational inference. - Regularizes the latent space to accurately represent batch-specific patterns without overfitting. --- ## **2. scMEDAL Setup and Installation** General structure of the repository: ```markdown scMEDAL_for_scRNAseq/ |-- Experiments/ # Scripts and notebooks for experiments |-- scMEDAL/ # Main package | |-- __init__.py | |-- models/ # Model definitions | | |-- __init__.py | | |-- scMEDAL/ | | |-- models/ | |-- utils/ # Utilities for preprocessing, training, etc. | | |-- __init__.py | |-- scMEDAL_env/ # Environment YAML files |-- setup.py # Package setup ``` ### **Installing `scMEDAL`** 1. **Clone repository** 2. **Setup and activate your environment** ```bash conda activate your_env_name ``` 3. **Install in editable mode** Navigate to the `scMEDAL_for_scRNAseq` directory and install: ```bash cd /path/to/scMEDAL_for_scRNAseq pip install -e . ``` 4. **Verify installation** ```python from scMEDAL.utils import your_function print("scMEDAL is ready to use!") ``` The estimated time for installation is around 30 mins. --- ## **3. Execution Environments** To handle dependency conflicts, `scMEDAL` uses three separate Conda environments: 1. **`genomaps_env`**: For generating Genomaps. 2. **`preprocess_and_plot_umaps_env`**: For data preprocessing and UMAP visualization. 3. **`run_models_env`**: For data splitting and running models. ### **Setting Up the Environments** 1. Navigate to the `scMEDAL_env` directory: ```bash cd /path/to/scMEDAL_for_scRNAseq/scMEDAL_env ``` 2. Create each environment: ```bash conda env create -f genomaps_env.yaml conda env create -f preprocess_and_plot_umaps_env.yaml conda env create -f run_models_env.yaml ``` 3. Activate the desired environment: ```bash conda activate genomaps_env ``` or ```bash conda activate preprocess_and_plot_umaps_env ``` or ```bash conda activate run_models_env ``` ### **Switching Environments** - **Match the Environment to the Task** Use the Conda environment that corresponds to the specific script or task you need to run. - **Install Required Packages** Make sure that all relevant environments have the `scMEDAL` package installed (see Step 2 above for instructions). - **Configure Your Slurm Scripts** When submitting jobs via Slurm, load the appropriate Conda environment before executing the script. For example: ```bash # For running models source activate /path/to/run_models_env # For preprocessing and plotting UMAPs source activate /path/to/preprocess_and_plot_umaps_env # For generating genomaps source activate /path/to/genomaps_env ``` By following the steps above, you ensure each script is run in the correct environment, with the necessary dependencies in place. ## **4. scMEDAL Utilities and Modules** ### **Utilities** - **[utils.py](./scMEDAL/utils/utils.py):** Provides data I/O, plotting, and clustering score functions. - **[model_train_utils.py](./scMEDAL/utils/model_train_utils.py):** Functions for training and loading models. - **[splitter.py](./scMEDAL/utils/splitter.py):** Utility for k-fold cross-validation splitting. - **[callbacks.py](./scMEDAL/utils/callbacks.py):** Tracks clustering metrics during training. - **[compare_results_utils.py](./scMEDAL/utils/compare_results_utils.py):** Combines clustering results from multiple models. - **[genomaps_utils.py](./scMEDAL/utils/genomaps_utils.py):** Custom Genomap generation functions. - **[preprocessing_utils.py](./scMEDAL/utils/preprocessing_utils.py):** Preprocessing routines for datasets. - **[utils_load_model.py](./scMEDAL/utils/utils_load_model.py):** Utilities for loading trained models. ### **Models** - **[scMEDAL.py](./scMEDAL/models/scMEDAL.py):** Implements AEC, DA_AE, and DomainEnhancingAutoencoderClassifier models. - **[random_effects.py](./scMEDAL/models/random_effects.py):** Bayesian layers and utilities for random effects modeling. --- ## **5. Experiment Setup** This setup will allow you to run our models in the Healthy Heart, ASD and AML datasets. **Experiment Folder Structure**: Each dataset-specific experiment follows a standard directory layout: ```markdown scMEDAL_for_scRNAseq/ |-- Experiments/ |-- data/ # Download and Setup your data folders |-- outputs |-- / |-- preprocessing/ | |-- 5fold_cross_val/ | | |-- create_splits.ipynb | | |-- check_splits.ipynb | | |-- config_split_paths.py | |-- preprocess_datasetname.py | |-- batch_preprocess_dataset.sh | |-- preprocess_datasetname.ipynb |-- run_models/ | |-- AE/ | |-- AEC/ | |-- scMEDAL-FEC/ | |-- scMEDAL-FE/ | |-- scMEDAL-RE/ | |-- compare_results/ | | |-- clustering_scores/ | | |-- genomaps/ | | |-- umap_plots/ | |-- MEC/ | |-- target/ | |-- scMEDAL-FEandscMEDAL-RE_latent/ | |-- scMEDAL-FE/ | |-- PCA_latent/ |-- paths_config.py ``` - **`data/`** - *(Download and set up your data folders here.)* - **`outputs/`** - *(This folder will be created automatically when running: `import outputs_path` from `paths_config.py`)* - **`datasetname/`** - Folder with scripts to preprocess and run models. For instructions on setting up experiments, see **[How2SetupYourExpt](./docs/How2SetupYourExpt.md)**. ### **Model Configuration** Each model directory contains a `model_config.py` file that specifies settings and paths. For example: - [Healthy Heart AE Model Configuration](./Experiments/HealthyHeart/run_models/AE/model_config.py) **Note:** You can update the number of epochs you want to run by modifying the `epochs` parameter in the dictionary: ```python train_model_dict = { "epochs": 2, # For testing; for full experiments, use a larger value (e.g., 500) # "epochs": 500, # Number of training epochs used in our experiments } ``` --- ## **6. Dataset-Specific Instructions** To set up the datasets for your experiments, follow these steps: 1. **Download the datasets** from the provided sources. 2. **Save them in the appropriate directories** under the main folder: **`/Experiments/data`**. - If the required subfolders do not exist, create them before saving the datasets. ### **Datasets and Sources** - **Healthy Human Heart** - Source: [Figshare from Yu et al. (2023)](https://figshare.com/articles/dataset/Batch_Alignment_of_single-cell_transcriptomics_data_using_Deep_Metric_Learning/20499630/2) - Save the dataset in: `/Experiments/data/HealthyHeart_data` - *Note: Create the folder `HealthyHeart_data` if it does not already exist.* - **Autism Spectrum Disorder (ASD)** - Source: [Autism Cell Atlas (Speir et al., 2021; Velmeshev et al., 2019)](https://autism.cells.ucsc.edu) - Save the dataset in: `/Experiments/data/ASD_data` - *Note: Create the folder `ASD_data` if it does not already exist.* - **Acute Myeloid Leukemia (AML)** - Source: [GEO: GSE116256](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256) - Save the dataset in: `/Experiments/data/AML_data` - *Note: Create the folder `AML_data` if it does not already exist.* --- ## **7. Running Models and Experiments** You can run AE, AEC, scMEDAL-FE, scMEDAL-FEC, or scMEDAL-RE independently. PCA can be generated simultaneously by setting `"get_pca": True` in `config.py`. The MEC model requires latent outputs from one of the above models; it cannot run independently. ### **Steps to Run Models** 1. **Run All Folds Locally:** ```bash python run_modelname_allfolds.py ``` 2. **Submit Jobs via Slurm:** ```bash sbatch sbatch_run_modelname.sh ``` For detailed instructions, see **[How2RunYourExpt](./docs/How2RunYourExpt.md)**. ### **Important Notes** - Always activate the correct Conda environment before running scripts. --- ## **8. Experiment Outputs** For more information about output files and their contents, refer to [ExperimentOutputs](./docs/ExperimentOutputs.md). --- ## **9. Analyzing Your Model Outputs** For guidance on analyzing and interpreting model outputs, see [How2AnalyzeYourModelOutputs](./docs/How2AnalyzeYourModelOutputs.md). --- ## **10. References** - Litvinukova, M. et al. Cells of the adult human heart. Nature 588, 466-472 (2020). - van Galen, P. et al. *Single-Cell RNA-Seq Reveals AML Hierarchies Relevant to Disease Progression and Immunity.* Cell 176, 1265?1281.e24 (2019). - Velmeshev, D. et al. *Single-cell genomics identifies cell type-specific molecular changes in autism.* Science 364, 685?689 (2019). - Speir, M. L. et al. *UCSC Cell Browser: visualize your single-cell data.* Bioinformatics 37, 4578?4580 (2021). - Yu, X., Xu, X., Zhang, J., & Li, X. *Batch alignment of single-cell transcriptomics data using deep metric learning.* Nat Commun 14, 960 (2023). - Yu, X., Xu, X., Zhang, J., & Li, X. *Batch alignment of single-cell transcriptomics data using deep metric learning.* figshare [https://doi.org/10.6084/m9.figshare.20499630.v2](https://doi.org/10.6084/m9.figshare.20499630.v2) (2023).