This document describes how to organize and configure experiments, including how to preprocess data, adjust paths, and update run names for post-analysis tasks (e.g., generating 95% confidence intervals, UMAPs, and Genomaps).
Experiments/
|-- data/
| |-- <datasetname>_data/
|-- <datasetname>/ # Replace <datasetname> with your chosen experiment name (e.g., HealthyHeart, AML, ASD)
| |-- paths_config.py
| |-- preprocessing/
| |-- run_models/
| |-- AE/ # Autoencoder model scripts/results
| |-- AEC/ # Autoencoder Classifier
| |-- scMEDAL-FEC/ # Autoencoder Classifier with Fixed Effects
| |-- scMEDAL-FE/ # Autoencoder with Fixed Effects
| |-- scMEDAL-RE/ # Autoencoder with Random Effects
| |-- compare_results/
| |-- clustering_scores/ # Scripts and data for clustering evaluation
| |-- genomaps/ # Scripts and data for genomap generation
| |-- umap_plots/ # Scripts for UMAP visualization
|-- outputs/
| |-- <datasetname>_outputs/
Make sure you have downloaded and setup your data folders in /Experiments/data
.
If the required subfolders do not exist, create them before saving the datasets.
data/
: Holds your datasets. For example:
data/HealthyHeart_data
for the HealthyHeart dataset.
data/ASD_data
for the AML dataset.
data/AML_data
for the ASD dataset.
<datasetname>/
: Contains experiment-specific configurations and code:
paths_config.py
: Defines paths to data, outputs, and scenario identifiers.preprocessing/
: Houses scripts and notebooks for preparing and cleaning your dataset.run_models/
: Contains scripts for training and evaluating various models.outputs/
: Stores output results. For example:
outputs/HealthyHeart_outputs
for the HealthyHeart results.outputs/AML_outputs
for AML results.outputs/ASD_outputs
for ASD results.In some scripts, especially those located in nested directories, you might see lines like:
import sys
sys.path.append("../../")
or
sys.path.append("../../../")
These adjustments ensure Python can locate shared modules or paths_config.py
files located higher in the directory structure. If you change the folder layout or move scripts around, you must adjust these relative paths accordingly. For instance:
run_models/compare_results/umap_plots/
one level up might allow you to change sys.path.append("../../../")
to sys.path.append("../../")
.Always verify that the paths align with your current directory structure.
paths_config.py
The paths_config.py
file is crucial for setting up correct paths to your data and outputs. It also defines scenario identifiers, run names, and other configuration details. Below is an example configuration for the HealthyHeart dataset:
import os
# Get the directory of the current file (paths_config.py)
base_dir = os.path.dirname(os.path.abspath(__file__))
# Define the data base path relative to the current file
data_base_path = os.path.join(base_dir, "../data/HealthyHeart_data")
print("data_base_path:", data_base_path)
# Specify the scenario_id to consistently represent a particular preprocessing setup
scenario_id = "log_transformed_3000hvggenes"
input_base_path = os.path.join(data_base_path, scenario_id, 'splits')
# Define the output paths for saving results
outputs_path = os.path.join(base_dir, "../outputs/HealthyHeart_outputs")
print("outputs_path:", outputs_path)
What to Consider When Modifying Paths:
data_base_path
: Points to your main data directory. If you rename or move your data folder, update this line accordingly.scenario_id
: Specifies a particular scenario for your experiments (e.g., using a specific preprocessing or feature selection method). Change this to match your scenario directory structure inside data_base_path
.outputs_path
: Points to where your experiment results, model checkpoints, and figures will be saved. Adjust this if you move or rename the outputs
folder.If you add another level to your directory structure or move paths_config.py
deeper into subdirectories, you might need to update data_base_path
and outputs_path
to ensure correct relative paths. For example, if you move paths_config.py
one level deeper, you may need to add another ..
to the paths.
Once you have run your models, each run is often associated with a unique timestamped name. To generate tables with 95% confidence intervals, UMAPs, or Genomaps, you need to identify the run name and update the expt
section of paths_config.py
.
For example:
expt = "expt_test"
if expt == "expt_test":
scaling = "min_max"
# Unique run names with timestamps should be provided here
run_names_dict = {
"scMEDAL-RE": "scMEDAL-RE_run_name",
"run_name_all": "DefineGeneralname4yourexpt"
}
# Set True if you plan to calculate clustering scores
calculate_clustering_scores = True
# If calculating clustering scores, add other models
if calculate_clustering_scores:
run_names_dict.update({
"AE": "AE_run_name",
"AEC": "AEC_run_name",
"scMEDAL-FEC": "scMEDAL-FEC_run_name",
"scMEDAL-FE": "scMEDAL-FE_run_name"
})
Steps to Update:
Set the Experiment Identifier (expt
):
Choose a descriptive name for your experiment, e.g., "expt_healthyheart_v1"
.
Run Names:
Replace the placeholder run names (e.g., "scMEDAL-RE_run_name"
) with the actual run names generated during model training. These run names are typically created automatically by your training scripts and often include a timestamp.
Add or Remove Models:
If you run additional models (e.g., scMEDAL-FE
or AEC
), add their unique run names. If you choose not to calculate clustering scores, set calculate_clustering_scores = False
and remove or omit the extra model entries.
Regenerating Results:
After updating run_names_dict
with the correct run names, rerun your result generation scripts (e.g., scripts in compare_results/clustering_scores/
, umap_plots/
, or genomaps/
), and they will use these updated run names to locate and process the correct results.
sys.path.append
to import modules. If you move scripts around, adjust the relative paths.paths_config.py
whenever you change data or output directories. This ensures all scripts know where to find input data and save outputs.run_names_dict
in paths_config.py
with the correct unique run names. This is essential for generating summary tables, visualizations, and analysis plots.