As Knowledge Graphs (KGs) gain traction in both industry and the public sector, more and more legacy databases are accessed through a KG-based layer. Querying such layers requires the mastery of intricate declarative languages such as SPARQL, prompting the need for simpler interfaces, e.g. in natural language (NL). However, translating NL questions into SPARQL and executing the resulting queries on top of a KG-based access layer is impractical for two reasons: (i) automatically generating correct SPARQL queries from NL is difficult as training data is typically scarce and (ii) executing the resulting queries through a simplistic KG layer automatically derived from an underlying relational schema yields poor results.
To solve both issues, we introduce ValueNet4SPARQL, an end-to-end NL-to-SPARQL system capable of generating high-quality SPARQL queries from NL questions using a transformer-based neural network architecture. ValueNet4SPARQL can re-use neural models that were trained on SQL databases and therefore does not require any additional NL/SPARQL-pairs as training data. In addition, our system is able to reconstruct rich schema information in the KG from its relational counterpart using a workload-based analysis, and to faithfully translate complex operations (such as joins or aggregates) from NL to SPARQL. We apply our approach for reconstructing schema information in the KG on a well-known dataset and show that it considerably improves the accuracy of the NL-to-SPARQL results---by up to 36% (for a total SemQL to SPARQL translation accuracy of 94%) ---compared to a standard baseline.
It is important to note that both the databases and developement set SQL queries have been improved to be conformant with PostgreSQL and best practices in data base design.
There are a number of queries that return false negative results sets. Most often these are queries that have a LIMIT 1 that would otherwise return several rows that have the same values in the column by which they are ordered. Other queries that are considered false negatives are queries that use AVG and SUM aggregations, the PostgreSQL queries return slightly more precise results sets than SPARQL. Additionally, there are 4 queries with very long results sets, to prevent a distorted output file, they have been removed from the output csv and instead added to the second sheet of the false_negatives.xlsx
.
The latest evaluation results for the SemQL to SQL evaluation and the SemQL to SPARQL evaluation with Enriched Knowledge Graphs are located in ValueNet4SPARQL/src/experiments/SparqlPrediction.
Disclaimer: this code is based on the ValueNet (https://github.com/brunnurs/valuenet) repository.
You can either install the script with pip pip install -r requirements.txt
or with pipenv pipenv install
. After installing you can run the tasks either from the command line or in PyCharm. To run then in PyCharm, simply import the run configurations from the .run folder. In addition, you will need to download the spacy en_core_web_sm model with python -m spacy download en_core_web_sm
This setup will run both the evaluation for the semQL to SPARQL translation with the enriched Knowledge Graphs, and the evaluation for the semQL to SQL translation from the original system ValueNet (using the updated ground truth PostgreSQL conformant queries and executing them against PostgreSQL databases, not SQLite like the original system). To run the evaluation for the baseline knowledge graphs you will need to replace the filename with KG_baseline_ont.json
in both the playground_evaluation_sparql.py
and spider_utils_sparql.py
.
A fine tuned model can be downloaded here