kglids / kg_governor / data_profiling / README.md
README.md
Raw

Data Profiling

A folder containing the different packages and the source files.

analysis: A package containing the sources files dedicated to analyze the tables. By analyzing we mean interpreting (determining the column data-type and profiling creation)

​ **interpreter: ** A package responsible for determining the data type of the columns of the loaded table. It contains:

interpreter.py: A source file responsible for determining the data type of each column. (either numerical or textual). These types are determined when loading the csv file into a Spark dataframe with passing infer_schema set to true as an argument. This will however, require Spark to internally go over the content.

profile_creator: A package responsible for creating the profile for each column depending on its datatype. It contains:

analysers: a package containing classes for analysers where each is dedicated for one of the data type the interpreter is capable of determining. It contains:

i_analyser.py: An interface for all the analysers per data-type

numerical_analyser.py: A source file used to collect statistics about the the numerical columns. The collection of the statistics depends on the built_in function of the Spark DF, summary(). However, some statistics like the number_of_missing_values and the number_of_distinct_values are calculated using the Resilient Distributed Datasets (RDD) data structure.

textual_analyser.py: A source file used to collect statistics and embedding about the textual columns. We use RDD to get the distinct and missing values of per column. In addition, we determine a minHash for each column with size 512. To compute the minhash, we use the library datasketch.

profile_creator.py: A source file that uses the analysers to create data profiles.

utils.py: utility functions used across the source files in the package.

data: A package containing the data structures to be used in addition to the functionalities to parse the config.yml file. it contains:

tables: A package containing the classes of the data structures used to handle the tables extracted from the datasets. For each table type (csv, json,..) there should exist a class dedicated to parsing the file. It contains:

i_table.py: An interface for the classes used to handle a table based on its type.

csv_table.py: A class responsible for storing the information of csv files extracted from the datasets mentionedin the config.yml file. It retains the information about the path, dataset name, table name, and origin.

utils: A package containing the different common functionalities used in the parent package. It contains:

file_type.py: An enum dedicated to specify the file types to be considered for parsing.

yaml_parser.py A source file used to parse the config.yaml file.

    <u>data_profile.py:</u> A class that encapsulates the profile to be stored on the document database.

raw_data.py A class that encapsulates the column values to be stored on the document database.

orchesteation: A package containing functionalities coordinate the different components of the profiler. It contains:

orchestrator.py: A class responsible for firing up elasticsearch, extracting the tables from the datasets specified in the config.yml file, and passing them to the worker thread to be processed.

utils.py: A souce file containing common functionalities like extracting teh tables from the specified datasets and getting the types.

worker.py: A class that implements thread. Each worker is responsible of handling the table handed by the orchestrator. handling a table means, interpret the columns, profile, and then store them on the document database.

main.py: Used to run the profiler. You can specify the number of thread to run here by passing the number in the process_tables as argument. In addition, you need to specify the the path to the config.yml fiule in the create_tables function. By default, the file is under profiler/src/config/.

utils.py: Containing the function that generates an id for the column based on the dataset, table, and file names.

  1. tests: Contains tests about the different functionalities in the different packages under src/