KGLiDS - Linked Data Science Powered by Knowledge Graphs

KGLiDS_architecture

In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and expe- riences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc. Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we therefore propose a novel system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enable Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal, and show that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as datasets and pipeline recommendation.

Installation

Clone the kglids repo
Create kglids Conda environment (Python 3.8) and install pip requirements.
Activate the kglids environment

conda activate kglids

Quickstart

Try the Sample KGLiDS Colab notebook for a quick hands-on!

Generating the LiDS graph:

Add the data sources to config.py:

# sample configuration
# list of data sources to process
data_sources = [DataSource(name='benchmark',
                           path='/home/projects/sources/kaggle',
                           file_type='csv')]

Run the Data profiler

cd kg_governor/data_profiling/src/
python main.py

Run the Knowledge graph builder to generate the data_items graph

cd kg_governor/knowledge_graph_construction/src/
python data_global_schema_builder.py

Run the Pipeline abstractor to generate the pipeline named graph(s)

cd kg_governor/pipeline_abstraction/
python pipelines_analysis.py

Uploading LiDS graph to the graph-engine (we recommend using Stardog):

Create a database Note: enable support for RDF * (example given below) more info here

stardog-admin db create -o edge.properties=true -n Database_name

Add the dataset-graph to the database

stardog data add --format turtle Database_name dataset_graph.ttl

Add the pipeline default graph and named-graphs to the database

stardog data add --format turtle Database_name default.ttl library.ttl

import os
import stardog
database_name = 'Database_name'
connection_details = {
      'endpoint': 'http://localhost:5820',
      'username': 'admin',
      'password': 'admin'}

conn = stardog.Connection(database_name, **connection_details)
conn.begin()
ttl_files = [i for i in os.listdir(graphs_dir) if i.endswith('ttl')]
for ttl in ttl_files:
    conn.add(stardog.content.File(graphs_dir + ttl), graph_uri= 'http://kglids.org/pipelineResource/'
conn.commit()
conn.close()

Using the KGLiDS APIs:

KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of KGLiDS APIs

LiDS Ontology

To store the created knowledge graph in a standardized and well-structured way, we developed an ontology for linked data science: the LiDS Ontology.
Checkout LiDS Ontology!

Benchmarking

The following benchmark datasets were used to evaluate KGLiDS:

Dataset Discovery in Data Lakes
- Smaller Real
- Synthetic
  (more info on data discovery benchmarks here)
Kaggle
- setup_kaggle_data.py

KGLiDS APIs

See the full list of supported APIs here.

Citing Our Work

If you find our work useful, please cite it in your research.

Publicity

This repository is part of our submission. We will make it available to the public research community upon acceptance.

Questions

For any questions please contact us:

mossad.helali@concordia.ca

shubham.vashisth@concordia.ca

philippe.carrier@concordia.ca

khose@cs.aau.dk

essam.mansour@concordia.ca