# KGLac Knowledge Graph Builder The KG builder component is responsible for leveraging the profiles stored in the document database to build the KG. To do so, it uses RDF-star to represent the entities and their relationships. the KG builder uses the profiles to determine the different relationships between the entities. The relationships include semanticSimilarity, schemaSimilairty, primary key-secondary key (pkfk), and inclusion dependency. The generated graph will be then hosted on a RDF store supporting RDF-star (for now Apache Jena, Blazegraph) ### Components ------ #### SRC: 1. **api** 1. api.py: Contains the different apis to be used in jupyter notebook. 2. **dataanalysis.py:** This folder was used to determine the semanticSimilarities before using the embedding. It should be removed. 3. **enums:** 1. relation.py: An enum definining the various relationships reflected in the KG. 4. **out:** A folder containing the resultant KG 1. triples.ttl: a file containing the tripes of the KG in a turtle serialization. 5. **storage:** A package containing the different source files dedicated to handle query the document DB and RDF store. 1. elasticsearch_client.py: A source file containing the functions used to communicate with elasticsearch. 2. kglac_client.py: A source file containing the functions used to communicate with the the RDF store. The function in this file are the implementation of function used in the api.py source file under **api** package. 3. kwtype.py: An enu containing some metadata that was previously used. This should omitted. 4. query_templates: A source file containing the query templates to used by the kglac_client.py functions to interact with the RDF store. 6. **word_embedding:** A package containing source files aimed to launch the word embedding server used for to determine the existence of the semanticSimilarity 1. embeddings_client: A source file representing the client to the embedding server. 2. libclient.py: A source file containing core functionalities used by the embeddings_client source file 3. word_embeddings: A source file containing the core functionalities offered by the server. 4. word_embeddings_services.py: A source file containing the servcies offered by the embedding server. Here we also specify the path of the embeddings to load. 7. config.py: A source file containing parameters to run the KG builder. 8. label.py: A class to represent the object of the label predicate. The object in addition to the text of the label it should specify the language. 9. rdf_builder.py: A source file containing the different functions used to create the entities, determine the different relationships mentioned above and dump the triples in the file triple.ttl found under **out**. 10. rdf_builder_coordinator.py: The main source file to run the fire the KG builder. It interacts with te various stores and the rdf_builder to create the KG. 11. rdf_resource.py: A class for an RDF triple component. It can be the subject, predicate, or the object. 12. triplet.py: A class to represent the triples componsed. It uses recursion to model rdf-star. 13. utils.py: A source file containing helpful functions like generating the label which is used across different source files in different packages like api.py and rdf_builder.py. It also contains the code to generate the graph visualization using the graphviz library. #### tests ​ This folder contains the different tests to run to make sure that the code works properly. ### How to run? ------ 1. Connect to the vm and run elasticsearch: 1. Open terminal and connect to the vm ``` ssh -i path/to/ahmed-keypairs.pem ubuntu@206.12.92.210 ``` 2. Go the folder of **app_servers** ``` cd /mnt/discovery/app_servers ``` 3. Run ES 7.10.2 ``` elasticsearch-7.10.2/bin/elasticsearch ``` Run ES 7.10.2**Note:** You can use kibana as a UI to see the profiles and the raw_data stored on the different indexes (profiles and raw_data respectively) 2. Run the the KG builder ``` python rdf_builder_coordinator.py -opath out ``` 3. Launch Blazegraph 1. Open a new terminal and connect to the vm ``` ssh -i path/to/ahmed-keypairs.pem -L 9999:localhost:9999 ubuntu@206.12.92.210 ``` 2. Run Blazegraph ``` cd /mnt/discovery/app_servers/blazegraph java -server -Xmx4g -jar blazegraph.jar ``` 3. Create a namespace 1. Open your browser 2. Go to http://localhost:9999/blazegraph 3. Go to namespaces. 4. Create your NAMESPACE by specifying the name and set the mode to rdr to support rdf-star. 5. Go to UPDATE. If you will upload the data specify the RDF DATA type and Turtle-RDR in the format. Elase speficy the path of triples.ttl 6. Once the the data is loaded, go to WELCOME to start writing your query or using the APIs.