# KGLiDS - Linked Data Science Powered by Knowledge Graphs ![KGLiDS_architecture](docs/graphics/kglids_architecture.jpg)
In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and expe- riences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc. Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we therefore propose a novel system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enable Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal, and show that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as datasets and pipeline recommendation.
## Installation * Clone the `kglids` repo * Create `kglids` Conda environment (Python 3.8) and install pip requirements. * Activate the `kglids` environment ```commandline conda activate kglids ``` ## Quickstart

Try the Sample KGLiDS Colab notebook for a quick hands-on!

Generating the LiDS graph: * Add the data sources to [config.py](kg_governor/data_profiling/src/config.py): ```python # sample configuration # list of data sources to process data_sources = [DataSource(name='benchmark', path='/home/projects/sources/kaggle', file_type='csv')] ``` * Run the [Data profiler](kg_governor/data_profiling/src/main.py) ```commandline cd kg_governor/data_profiling/src/ python main.py ``` * Run the [Knowledge graph builder](kg_governor/knowledge_graph_construction/src/data_global_schema_builder.py) to generate the data_items graph ```commandline/ cd kg_governor/knowledge_graph_construction/src/ python data_global_schema_builder.py ``` * Run the [Pipeline abstractor](kg_governor/pipeline_abstraction/pipelines_analysis.py) to generate the pipeline named graph(s) ``` cd kg_governor/pipeline_abstraction/ python pipelines_analysis.py ```
Uploading LiDS graph to the graph-engine (we recommend using [Stardog](https://www.stardog.com/)): * Create a database Note: enable support for RDF * (example given below) more info [here](https://docs.stardog.com/query-stardog/edge-properties) ```commandline stardog-admin db create -o edge.properties=true -n Database_name ``` * Add the dataset-graph to the database ```commandline stardog data add --format turtle Database_name dataset_graph.ttl ``` * Add the pipeline default graph and named-graphs to the database ```commandline stardog data add --format turtle Database_name default.ttl library.ttl ``` ```python import os import stardog database_name = 'Database_name' connection_details = { 'endpoint': 'http://localhost:5820', 'username': 'admin', 'password': 'admin'} conn = stardog.Connection(database_name, **connection_details) conn.begin() ttl_files = [i for i in os.listdir(graphs_dir) if i.endswith('ttl')] for ttl in ttl_files: conn.add(stardog.content.File(graphs_dir + ttl), graph_uri= 'http://kglids.org/pipelineResource/' conn.commit() conn.close() ```
Using the KGLiDS APIs: KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of [KGLiDS APIs](docs/KGLiDS_apis.md) ## LiDS Ontology To store the created knowledge graph in a standardized and well-structured way, we developed an ontology for linked data science: the LiDS Ontology.
Checkout [LiDS Ontology](docs/LiDS_ontology.md)! ## Benchmarking The following benchmark datasets were used to evaluate KGLiDS: * Dataset Discovery in Data Lakes * [Smaller Real](https://github.com/alex-bogatu/d3l) * [Synthetic](https://github.com/RJMillerLab/table-union-search-benchmark)
(more info on data discovery benchmarks [here](https://arxiv.org/pdf/2011.10427.pdf)) * Kaggle * [`setup_kaggle_data.py`](storage/utils/setup_kaggle_data.py) ## KGLiDS APIs See the full list of supported APIs [here](docs/KGLiDS_apis.md). ## Citing Our Work If you find our work useful, please cite it in your research. ## Publicity This repository is part of our submission. We will make it available to the public research community upon acceptance. ## Questions For any questions please contact us: mossad.helali@concordia.ca shubham.vashisth@concordia.ca philippe.carrier@concordia.ca khose@cs.aau.dk essam.mansour@concordia.ca