This guide provides an overview of the folder structure, data organization, and the steps to run your experiments, including data preprocessing, creating splits for cross-validation, and training models.
Experiments/
|-- data/
| |-- <datasetname>_data/
|-- <datasetname>/ # e.g., HealthyHeart, AML, ASD
| |-- paths_config.py
| |-- preprocessing/
| |-- 5fold_cross_val/
| |-- config_split_paths.py # Configures input/output paths for splits
| |-- create_splits.ipynb # Splits the data into train/val/test
| |-- check_splits.ipynb # Checks for data leakage in splits
| |-- preprocess_<datasetname>.py # Main preprocessing script (Python)
| |-- batch_preprocess_<datasetname>.sh # Optional SLURM script for batch preprocessing
| |-- preprocess_<datasetname>.ipynb # Optional preprocessing notebook (Jupyter)
|-- run_models/
| |-- <modelname>/ # Model-specific scripts (e.g., AE, AEC, etc.)
| |-- model_config.py # Model hyperparameters & output settings
| |-- run_<modelname>_allfolds.py # Runs the model pipeline for all folds
| |-- sbatch_run_<modelname>.py # SLURM script to run the model
|-- outputs/
| |-- <datasetname>_outputs/
Experiments/
|-- run_models/
|-- <MEC>/ # Mixed Effects Classifier
|-- <target_type>_target/ # Target type for the model
|-- <latent_space_combo>_latent/ # Combination of latent spaces
|-- model_config.py # Model hyperparameters & output settings
|-- run_<modelname>_allfolds.py # Runs the model pipeline for all folds
|-- sbatch_run_<modelname>.py # SLURM script to run the model
where <target_type> in celltype, dx and <latent_space_combo> in scMEDAL-FE, scMEDAL-FEandscMEDAL-RE, PCA
The count matrices for your datasets are stored as follows:
exprMatrix.npy
geneids.csv
meta.csv
Within data/<datasetname>/
, your data might be structured like this:
data/
|-- <datasetname>/
|-- <countmatrixname>/ # Count matrix before preprocessing
|-- exprMatrix.npy or exprMatrix.tsv
|-- geneids.csv or geneids.tsv
|-- meta.csv or meta.tsv
|-- <scenario_id>/ # Output of preprocessing scripts or notebooks
|-- exprMatrix.npy
|-- geneids.csv
|-- meta.csv
|-- splits/ # Created by create_splits.ipynb
|-- split_1/
|-- test/ # Each folder contains the count matrices for that split
|-- train/
|-- val/
|-- split_2/
|-- split_3/
|-- split_4/
|-- split_5/
Each split_X/
directory contains train, val, and test subsets of the data.
See instructions in How2SetupYourExpt.
Option A (Notebook):
Run the preprocess_<datasetname>.ipynb
notebook (suitable for datasets like HealthyHeart).
Option B (Python Script):
For other datasets (e.g., ASD, AML), use the Python script:
python preprocess_<datasetname>.py
Option C (SLURM):
If you have a SLURM environment, run the batch preprocessing script:
sbatch batch_preprocess_<datasetname>.sh
Environment: Use the preprocess_and_plot_umaps_env
environment for preprocessing.
Note: If you use SLURM, make sure to update the environment name in the SLURM script.
create_splits.ipynb
to generate the 5-fold splits.check_splits.ipynb
to verify that there is no data leakage.Environment: Use the run_models_env
environment for this step.
Each model has its own directory under run_models/
. For example, <modelname>/
might contain:
model_config.py
: Configure model hyperparameters, output paths, plotting parameters, and other settings. It also generates a unique run_name
(with a timestamp) needed for analyzing outputs.Example: Healthy Heart AE Model Configuration
Note:
epochs
parameter in the dictionary:train_model_dict = {
"epochs": 2, # For testing; for full experiments, use a larger value (e.g., 500)
# "epochs": 500, # Number of training epochs used in our experiments
}
run_<modelname>_allfolds.py
: Executes the entire training pipeline for all 5 folds.sbatch_run_<modelname>.py
: SLURM script to run the model training on a cluster.Customization Options:
model_config.py
to change hyperparameters (e.g., layer units, latent dimensions, epochs, how the model is loaded).run_<modelname>_allfolds.py
accordingly.Once configured, run:
python run_<modelname>_allfolds.py
or submit via SLURM:
sbatch sbatch_run_<modelname>.py
The AE, AEC, scMEDAL-FE, scMEDAL-FEC, or scMEDAL-RE models can be run independently.
The PCA model can be run simultaneusly with another model. Just set "get_pca": True in config.py
get_scores_dict = {
"get_pca": True
}
The The MEC model require latent_space
the outputs from the AE, AEC, scMEDAL-FE, scMEDAL-FEC, or scMEDAL-RE models. It cannot run without them.
Environment Note: Run the models using the run_models_env
.
Check more details about the outputs in ExperimentOutputs
Summary:
Once the model run is complete, you'll have timestamped run names and corresponding outputs ready for downstream analysis (such as generating confidence intervals, UMAP visualizations, and Genomaps). See more instructions in How2AnalyzeYourModelOutputs.