# Model Selection - Classification Methods
Apply different machine learning classification methods and select the best model.

<br>

![image](https://user-images.githubusercontent.com/25506296/132857340-5cf80f2d-0107-4a25-9941-278d4b44977e.png)

<br>

## How to run
- Open terminal/command line (Check package requirements)
- Run command: 
  - `python3 Classification.py training_dataset.tsv test_dataset.tsv`

<br>

## Data Pre-processing
The 1st, 2nd, 3rd, 4th, and 6th columns have been normalized (scaled between [0, 1]) as those 
column values considerably larger than the rest of the other columns. If not normalized these 
columns would have a greater impact on the predictions than the other columns. Since there is class 
imbalance in the data, Random Oversampling is applied to methods which benefit from it.
Note: Stratified CV is automatically done for classification methods in the GridSearchCV method.

<br>

## Models Considered
Multi Layer Perceptron, Support Vector Machine, and Random Forest models which are quite different from each other 
have been considered initially. Since Random Forest is performing much better than the other two, 
Gradient Boosting is also taken into consideration.

<br>

## References
- https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
- https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
- https://stackoverflow.com/questions/37689942/grid-search-finding-parameters-for-auc

<br>

## Collaboration
- Prudhvi Kommareddi
- Tarun Subramanian

<br>

## Packages Required
- python = 3.8.5
- numpy = 1.19.2
- pandas = 1.2.1
- scikit-learn = 0.23.2
- seaborn = 0.11.1
- matplotlib = 3.3.2
- imbalanced-learn = 0.8.0