kglids / kg_governor / knowledge_graph_construction / README.md
README.md
Raw

KGLac Knowledge Graph Builder

The KG builder component is responsible for leveraging the profiles stored in the document database to build the KG. To do so, it uses RDF-star to represent the entities and their relationships. the KG builder uses the profiles to determine the different relationships between the entities. The relationships include semanticSimilarity, schemaSimilairty, primary key-secondary key (pkfk), and inclusion dependency. The generated graph will be then hosted on a RDF store supporting RDF-star (for now Apache Jena, Blazegraph)

Components


SRC:

  1. api
    1. api.py: Contains the different apis to be used in jupyter notebook.
  2. dataanalysis.py: This folder was used to determine the semanticSimilarities before using the embedding. It should be removed.
  3. enums:
    1. relation.py: An enum definining the various relationships reflected in the KG.
  4. out: A folder containing the resultant KG
    1. triples.ttl: a file containing the tripes of the KG in a turtle serialization.
  5. storage: A package containing the different source files dedicated to handle query the document DB and RDF store.
    1. elasticsearch_client.py: A source file containing the functions used to communicate with elasticsearch.
    2. kglac_client.py: A source file containing the functions used to communicate with the the RDF store. The function in this file are the implementation of function used in the api.py source file under api package.
    3. kwtype.py: An enu containing some metadata that was previously used. This should omitted.
    4. query_templates: A source file containing the query templates to used by the kglac_client.py functions to interact with the RDF store.
  6. word_embedding: A package containing source files aimed to launch the word embedding server used for to determine the existence of the semanticSimilarity
    1. embeddings_client: A source file representing the client to the embedding server.
    2. libclient.py: A source file containing core functionalities used by the embeddings_client source file
    3. word_embeddings: A source file containing the core functionalities offered by the server.
    4. word_embeddings_services.py: A source file containing the servcies offered by the embedding server. Here we also specify the path of the embeddings to load.
  7. config.py: A source file containing parameters to run the KG builder.
  8. label.py: A class to represent the object of the label predicate. The object in addition to the text of the label it should specify the language.
  9. rdf_builder.py: A source file containing the different functions used to create the entities, determine the different relationships mentioned above and dump the triples in the file triple.ttl found under out.
  10. rdf_builder_coordinator.py: The main source file to run the fire the KG builder. It interacts with te various stores and the rdf_builder to create the KG.
  11. rdf_resource.py: A class for an RDF triple component. It can be the subject, predicate, or the object.
  12. triplet.py: A class to represent the triples componsed. It uses recursion to model rdf-star.
  13. utils.py: A source file containing helpful functions like generating the label which is used across different source files in different packages like api.py and rdf_builder.py. It also contains the code to generate the graph visualization using the graphviz library.

tests

​ This folder contains the different tests to run to make sure that the code works properly.

How to run?


  1. Connect to the vm and run elasticsearch:

    1. Open terminal and connect to the vm
    ssh -i path/to/ahmed-keypairs.pem ubuntu@206.12.92.210
    
    1. Go the folder of app_servers
    cd /mnt/discovery/app_servers
    
    1. Run ES 7.10.2
    elasticsearch-7.10.2/bin/elasticsearch
    

    Run ES 7.10.2Note: You can use kibana as a UI to see the profiles and the raw_data stored on the different indexes (profiles and raw_data respectively)

  2. Run the the KG builder

    python rdf_builder_coordinator.py -opath out
    
  3. Launch Blazegraph

    1. Open a new terminal and connect to the vm
    ssh -i path/to/ahmed-keypairs.pem  -L 9999:localhost:9999 ubuntu@206.12.92.210
    
    1. Run Blazegraph
    cd /mnt/discovery/app_servers/blazegraph
    java -server -Xmx4g -jar blazegraph.jar
    
    1. Create a namespace
      1. Open your browser
      2. Go to http://localhost:9999/blazegraph
      3. Go to namespaces.
      4. Create your NAMESPACE by specifying the name and set the mode to rdr to support rdf-star.
      5. Go to UPDATE. If you will upload the data specify the RDF DATA type and Turtle-RDR in the format. Elase speficy the path of triples.ttl
      6. Once the the data is loaded, go to WELCOME to start writing your query or using the APIs.