PAC-tree / readme.md
readme.md
Raw

Beyond Shuffling: PAC-Tree for Efficient Distributed Joins via Hierarchical Data Placement

Overview

This repository contains the implementation and experimental framework for PAC-Tree, a novel hierarchical data placement strategy designed to minimize network shuffling overhead in distributed join processing. Our approach fundamentally reimagines how data is partitioned and distributed across computing nodes by leveraging query workload characteristics and join key distributions.

Repository Structure

├── conf/                      # Context args
├── docker/                    # Docker compose files
├── model/                    # Core partitioning algorithms
│   ├── partition_algorithm.py    # Main PAC-Tree implementations
│   ├── partition_tree.py         # Tree data structure primitives
│   ├── partition_node.py         # Node-level operations and metrics
│   ├── join_eval.py              # Join cost evaluation framework
│   └── join_key_selector.py      # Join key analysis and selection
├── db/                       # Database connectivity and data management
│   ├── connector.py              # Multi-database connection handling
│   ├── data_tooler.py            # Data preprocessing and sampling
│   └── query_tooler.py           # Query parsing and workload extraction
├── queryset/                 # Pre-processed benchmark workloads
│   ├── tpch/                     # TPC-H query sets
│   ├── tpcds/                    # TPC-DS query sets
│   └── imdb/                     # IMDB workload 
├── dataset/
│   ├── tpch/                     # TPC-H dataset
│   ├── tpcds/                    # TPC-DS dataset
│   └── imdb/                     # IMDB dataset
├── spark/                    # Apache Spark integration
│   ├── data_writer.py            # Distributed data routing
│   └── test_spark.py            # Test Spark SQL query execution
├── examples/                  # Example images and results

Usage Instructions

Prerequisites

  • Python 3.9+
  • Apache Spark 3.x (for distributed experiments)
  • PostgreSQL (for data generation and query parse)
  • Required packages: numpy, pandas, scikit-learn, psycopg2

Quick Start

  1. Environment Setup

    conda create -n ai4db python=3.9
    conda activate ai4db
    pip install -r requirements.txt --force-reinstall
    cd docker && docker-compose up -d  # use docker to start PostgreSQL and Spark services (It is recommended to manually install a Spark cluster on multiple real machines.)
    
  2. Import necessary Dataset

    please follow the instructions of db/get_started/db_preparation.md
    
  3. Data Generation and Query Preparation

    python ./db/data_tooler.py --benchmark=tpch --export_metadata --export_csv
    python ./db/query_tooler.py --benchmark=tpch --format --export_mto_paw_queries --export_pac_queries
    # Configure more benchmarks please refer to run.sh
    
  4. PAC-Tree Construction

    python ./model/join_key_selector.py --init=True --benchmark=tpch
    # Configure more benchmarks please refer to run.sh
    
  5. Performance Evaluation

    python ./model/join_key_selector.py --mode=dubug --benchmark=tpch --command=0
    python ./model/join_key_selector.py --mode=dubug --benchmark=tpch --command=1
    # Configure more benchmarks please refer to run.sh
    

    Run example:

    Running Example

    Output example:

    Standard Result Example

License

This project is licensed under the MIT License - see the LICENSE file for details.