## Data Profiling
A folder containing the different packages and the source files.
**analysis**: A package containing the sources files dedicated to analyze the tables. By analyzing we mean interpreting (determining the column data-type and profiling creation)
**interpreter: ** A package responsible for determining the data type of the columns of the loaded table. It contains:
interpreter.py: A source file responsible for determining the data type of each column. (either numerical or textual). These types are determined when loading the csv file into a Spark dataframe with passing infer_schema set to true as an argument. This will however, require Spark to internally go over the content.
**profile_creator:** A package responsible for creating the profile for each column depending on its datatype. It contains:
***analysers:*** a package containing classes for analysers where each is dedicated for one of the data type the interpreter is capable of determining. It contains:
i_analyser.py: An interface for all the analysers per data-type
numerical_analyser.py: A source file used to collect statistics about the the numerical columns. The collection of the statistics depends on the built_in function of the Spark DF, summary(). However, some statistics like the number_of_missing_values and the number_of_distinct_values are calculated using the Resilient Distributed Datasets (RDD) data structure.
textual_analyser.py: A source file used to collect statistics and embedding about the textual columns. We use RDD to get the distinct and missing values of per column. In addition, we determine a minHash for each column with size 512. To compute the minhash, we use the library datasketch.
profile_creator.py: A source file that uses the analysers to create data profiles.
utils.py: utility functions used across the source files in the package.
**data:** A package containing the data structures to be used in addition to the functionalities to parse the config.yml file. it contains:
**tables:** A package containing the classes of the data structures used to handle the tables extracted from the datasets. For each table type (csv, json,..) there should exist a class dedicated to parsing the file. It contains:
i_table.py: An interface for the classes used to handle a table based on its type.
csv_table.py: A class responsible for storing the information of csv files extracted from the datasets mentionedin the config.yml file. It retains the information about the path, dataset name, table name, and origin.
**utils**: A package containing the different common functionalities used in the parent package. It contains:
file_type.py: An enum dedicated to specify the file types to be considered for parsing.
yaml_parser.py A source file used to parse the config.yaml file.
data_profile.py: A class that encapsulates the profile to be stored on the document database.
raw_data.py A class that encapsulates the column values to be stored on the document database.
**orchesteation**: A package containing functionalities coordinate the different components of the profiler. It contains:
orchestrator.py: A class responsible for firing up elasticsearch, extracting the tables from the datasets specified in the config.yml file, and passing them to the worker thread to be processed.
utils.py: A souce file containing common functionalities like extracting teh tables from the specified datasets and getting the types.
worker.py: A class that implements thread. Each worker is responsible of handling the table handed by the orchestrator. handling a table means, interpret the columns, profile, and then store them on the document database.
main.py: Used to run the profiler. You can specify the number of thread to run here by passing the number in the process_tables as argument. In addition, you need to specify the the path to the config.yml fiule in the create_tables function. By default, the file is under profiler/src/config/.
utils.py: Containing the function that generates an id for the column based on the dataset, table, and file names.
3. **tests:** Contains tests about the different functionalities in the different packages under src/