This repository contains the implementation and experimental framework for PAC-Tree, a novel hierarchical data placement strategy designed to minimize network shuffling overhead in distributed join processing. Our approach fundamentally reimagines how data is partitioned and distributed across computing nodes by leveraging query workload characteristics and join key distributions.
├── conf/ # Context args
├── docker/ # Docker compose files
├── model/ # Core partitioning algorithms
│ ├── partition_algorithm.py # Main PAC-Tree implementations
│ ├── partition_tree.py # Tree data structure primitives
│ ├── partition_node.py # Node-level operations and metrics
│ ├── join_eval.py # Join cost evaluation framework
│ └── join_key_selector.py # Join key analysis and selection
├── db/ # Database connectivity and data management
│ ├── connector.py # Multi-database connection handling
│ ├── data_tooler.py # Data preprocessing and sampling
│ └── query_tooler.py # Query parsing and workload extraction
├── queryset/ # Pre-processed benchmark workloads
│ ├── tpch/ # TPC-H query sets
│ ├── tpcds/ # TPC-DS query sets
│ └── imdb/ # IMDB workload
├── dataset/
│ ├── tpch/ # TPC-H dataset
│ ├── tpcds/ # TPC-DS dataset
│ └── imdb/ # IMDB dataset
├── spark/ # Apache Spark integration
│ ├── data_writer.py # Distributed data routing
│ └── test_spark.py # Test Spark SQL query execution
├── examples/ # Example images and results
numpy, pandas, scikit-learn, psycopg2Environment Setup
conda create -n ai4db python=3.9
conda activate ai4db
pip install -r requirements.txt --force-reinstall
cd docker && docker-compose up -d # use docker to start PostgreSQL and Spark services (It is recommended to manually install a Spark cluster on multiple real machines.)
Import necessary Dataset
please follow the instructions of db/get_started/db_preparation.md
Data Generation and Query Preparation
python ./db/data_tooler.py --benchmark=tpch --export_metadata --export_csv
python ./db/query_tooler.py --benchmark=tpch --format --export_mto_paw_queries --export_pac_queries
# Configure more benchmarks please refer to run.sh
PAC-Tree Construction
python ./model/join_key_selector.py --init=True --benchmark=tpch
# Configure more benchmarks please refer to run.sh
Performance Evaluation
python ./model/join_key_selector.py --mode=dubug --benchmark=tpch --command=0
python ./model/join_key_selector.py --mode=dubug --benchmark=tpch --command=1
# Configure more benchmarks please refer to run.sh
Run example:
Output example:
This project is licensed under the MIT License - see the LICENSE file for details.