Model Selection - Classification Methods

Apply different machine learning classification methods and select the best model.

How to run

Open terminal/command line (Check package requirements)
Run command:
- python3 Classification.py training_dataset.tsv test_dataset.tsv

Data Pre-processing

The 1st, 2nd, 3rd, 4th, and 6th columns have been normalized (scaled between [0, 1]) as those column values considerably larger than the rest of the other columns. If not normalized these columns would have a greater impact on the predictions than the other columns. Since there is class imbalance in the data, Random Oversampling is applied to methods which benefit from it. Note: Stratified CV is automatically done for classification methods in the GridSearchCV method.

Models Considered

Multi Layer Perceptron, Support Vector Machine, and Random Forest models which are quite different from each other have been considered initially. Since Random Forest is performing much better than the other two, Gradient Boosting is also taken into consideration.

References

Collaboration

Prudhvi Kommareddi
Tarun Subramanian

Packages Required

python = 3.8.5
numpy = 1.19.2
pandas = 1.2.1
scikit-learn = 0.23.2
seaborn = 0.11.1
matplotlib = 3.3.2
imbalanced-learn = 0.8.0