tensor-group-sym / python / large_scale / README.md
README.md
Raw

Large-scale experiments and ENN baselines

This subtree contains the GPU-ready code used for the large-scale QM9 evaluation of the ★_G framework against equivariant neural network baselines:

  • Full QM9 (~134k molecules) for HOMO-LUMO gap and ZPVE
  • Molecular tensor prediction: dipole vector µ (rank-1) and isotropic polarizability α (rank-2)
  • ENN baselines: SchNet (invariant), e3nn-based SE(3)-equivariant model, MACE (current SOTA on QM9)
  • Matched protocols: identical train/val/test splits, identical seeds, identical evaluation metrics across all methods

The MATLAB code in the rest of the repository remains the reference implementation. The PyTorch code here is a faithful re-implementation that scales to GPU and integrates with the e3nn / mace-torch baselines.

Layout

large_scale/
├── starg_torch/                 PyTorch ★_G algebra
│   ├── algebra.py               Group + cached F_G + irrep tables (cyclic, dihedral, octahedral, products)
│   ├── product.py               Batched ★_G product (torch.fft for cyclic, einsum for general)
│   ├── svd.py                   Batched ★_G-SVD on GPU
│   ├── features.py              torch port of Algorithm 2 (extractStarGFeatures)
│   ├── neural.py                Neural ★_G (torch.nn.Module)
│   └── octahedral.py            24-element octahedral group + 5 irreps
├── data/
│   ├── qm9.py                   Full QM9 loader (PyG-compatible)
│   └── featurizers.py           molecule → tensor mapping for each group
├── targets/
│   ├── scalar.py                HOMO-LUMO gap, ZPVE, etc.
│   ├── vector.py                Dipole vector µ (rank-1 target)
│   └── tensor.py                Polarizability α (rank-2 target)
├── train_starg.py               unified ★_G entry point (ridge | neural)
├── train_baseline_mlp.py        Standard / Invariant / Augmented MLP
├── train_baseline_schnet.py     SchNet (invariant ENN baseline)
├── train_baseline_e3nn.py       e3nn-based equivariant baseline
├── train_baseline_mace.py       MACE (current SOTA)
├── eval_collect.py              merges per-method JSON results into a table
└── bsub/                        IBM CCC LSF submission files
    ├── submit_starg_ridge.bsub      array job 1..18  (target × seed)
    ├── submit_starg_neural.bsub     array job 1..18
    ├── submit_mlp.bsub              array job 1..54  (mode × target × seed)
    ├── submit_schnet.bsub           array job 1..12  (scalars × seed)
    ├── submit_e3nn.bsub             array job 1..18
    ├── submit_mace.bsub             array job 1..18
    └── submit_all.sh                bsubs every .bsub above

Reproducing the revised experiments on IBM CCC

Setup (one-time)

The CCC compute nodes do not have conda; we use module load + user-level pip install --user. Each .bsub script does this on first run, so there is no manual env step beyond pushing the code:

# On the CCC login node, just push the code (see "Sending files" below)
# and submit. Dependencies install on the compute node from requirements.txt.

Launch all jobs

cd ~/starg/python/large_scale
bash bsub/submit_all.sh

This issues six bsub calls, one per array job, totaling 138 slots (see bsub/submit_all.sh for the per-method counts). Single-method launches:

bsub < bsub/submit_starg_ridge.bsub      # ★_G-SVD + Ridge only
bsub < bsub/submit_mace.bsub             # MACE only

To re-run a single array index (e.g. seed 1, target gap of MACE, which is array index 4):

bsub -J "starg_mace[4]" < bsub/submit_mace.bsub

This launches one job per (method, target) combination. Each job writes a JSON result file under results/<method>/<target>/seed<k>.json. After all jobs finish, run python eval_collect.py to assemble the revised result tables.

Methods covered

Method Target: HOMO-LUMO Target: µ vector Target: α scalar Target: α tensor
★_G-SVD + Ridge yes yes (per-component) yes yes (full tensor)
Neural ★_G yes yes yes yes
Standard MLP yes yes yes yes
Invariant MLP yes yes yes yes
Augmented MLP yes yes yes yes
SchNet yes n/a (invariant) yes n/a
e3nn (SE(3)-equiv) yes yes yes yes
MACE yes yes yes yes

The ENN baselines are pulled at pinned versions (see requirements.txt) to ensure exact reproducibility.

Running on a remote GPU cluster

The bsub/ subdirectory contains LSF submission scripts that assume the repo lives at $HOME/starg/ on the remote host and that QM9 .xyz files sit under $QM9_DIR (defaulting to $HOME/data/qm9/dsgdb9nsd/; override per-launch with QM9_DIR=/path/to/your/qm9 bash bsub/submit_all.sh). The scripts were written for IBM LSF (bsub) but the per-method invocations are plain Python, so they port to SLURM (sbatch) or local execution by replacing the #BSUB directives.

Sending the code to the cluster

ccc=<user>@<your-cluster-host>
ssh "$ccc" 'mkdir -p ~/starg'
rsync -avz --exclude '.git' --exclude '__pycache__' \
    --exclude 'logs/' --exclude 'results/' \
    ./ "$ccc":~/starg/

scp -r -C -p works equivalently if rsync isn't available.

Submit jobs

ssh "$ccc" 'cd ~/starg/python/large_scale && bash bsub/submit_all.sh'
ssh "$ccc" 'bjobs -l | head -40'    # watch

Pull results back

rsync -avz "$ccc":~/starg/python/large_scale/results/ \
    ./python/large_scale/results/

Optional: stash the host in ~/.ssh/config

Host ccc
    HostName            <your-cluster-host>
    User                <your-username>
    ServerAliveInterval 60
    ControlMaster       auto
    ControlPath         ~/.ssh/cm-%r@%h:%p
    ControlPersist      10m

Once configured the commands above shorten to ssh ccc ....

Note on QM9 data location

The bsub files default to QM9_DIR=$HOME/data/qm9/dsgdb9nsd. If your QM9 files live elsewhere, override at submit time without editing the .bsub files:

ssh ccc 'cd ~/starg/python/large_scale && QM9_DIR=/mnt/myshare/qm9/dsgdb9nsd bash bsub/submit_all.sh'

Hardware sizing

A single A100 (40 GB) handles every ★_G method in <1 hour for full QM9. MACE on QM9 takes ~6 hours per seed on a single H100. The submission scripts request 1×H100 and 16-core/64GB host memory by default.