## Data Profiling A folder containing the different packages and the source files. **analysis**: A package containing the sources files dedicated to analyze the tables. By analyzing we mean interpreting (determining the column data-type and profiling creation) ​ **interpreter: ** A package responsible for determining the data type of the columns of the loaded table. It contains: ​ interpreter.py: A source file responsible for determining the data type of each column. (either numerical or textual). These types are determined when loading the csv file into a Spark dataframe with passing infer_schema set to true as an argument. This will however, require Spark to internally go over the content. ​ ​ **profile_creator:** A package responsible for creating the profile for each column depending on its datatype. It contains: ​ ***analysers:*** a package containing classes for analysers where each is dedicated for one of the data type the interpreter is capable of determining. It contains: ​ i_analyser.py: An interface for all the analysers per data-type ​ numerical_analyser.py: A source file used to collect statistics about the the numerical columns. The collection of the statistics depends on the built_in function of the Spark DF, summary(). However, some statistics like the number_of_missing_values and the number_of_distinct_values are calculated using the Resilient Distributed Datasets (RDD) data structure. ​ textual_analyser.py: A source file used to collect statistics and embedding about the textual columns. We use RDD to get the distinct and missing values of per column. In addition, we determine a minHash for each column with size 512. To compute the minhash, we use the library datasketch. ​ profile_creator.py: A source file that uses the analysers to create data profiles. ​ utils.py: utility functions used across the source files in the package. ​ **data:** A package containing the data structures to be used in addition to the functionalities to parse the config.yml file. it contains: ​ **tables:** A package containing the classes of the data structures used to handle the tables extracted from the datasets. For each table type (csv, json,..) there should exist a class dedicated to parsing the file. It contains: ​ i_table.py: An interface for the classes used to handle a table based on its type. ​ csv_table.py: A class responsible for storing the information of csv files extracted from the datasets mentionedin the config.yml file. It retains the information about the path, dataset name, table name, and origin. ​ **utils**: A package containing the different common functionalities used in the parent package. It contains: ​ file_type.py: An enum dedicated to specify the file types to be considered for parsing. ​ yaml_parser.py A source file used to parse the config.yaml file. data_profile.py: A class that encapsulates the profile to be stored on the document database. ​ raw_data.py A class that encapsulates the column values to be stored on the document database. **orchesteation**: A package containing functionalities coordinate the different components of the profiler. It contains: ​ orchestrator.py: A class responsible for firing up elasticsearch, extracting the tables from the datasets specified in the config.yml file, and passing them to the worker thread to be processed. ​ utils.py: A souce file containing common functionalities like extracting teh tables from the specified datasets and getting the types. ​ worker.py: A class that implements thread. Each worker is responsible of handling the table handed by the orchestrator. handling a table means, interpret the columns, profile, and then store them on the document database. ​ main.py: Used to run the profiler. You can specify the number of thread to run here by passing the number in the process_tables as argument. In addition, you need to specify the the path to the config.yml fiule in the create_tables function. By default, the file is under profiler/src/config/. ​ utils.py: Containing the function that generates an id for the column based on the dataset, table, and file names. 3. **tests:** Contains tests about the different functionalities in the different packages under src/