## Binary classification problem

### Dataset :

We study first a binary classification problem, performed by a neural network. Each input has two real features, and the output can be only 0 or 1. The training set contains 4000 examples, and the validation set, 1000.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

# Display figures on jupyter notebook
%matplotlib inline

In [None]:
# Define a function to generate the dataset, in the form of two interlaced spirals
def spiral(phi):
 x = (phi+1)*torch.cos(phi)
 y = phi*torch.sin(phi)
 return torch.cat((x, y), dim=1)

def generate_data(num_data):
 angles = torch.empty((num_data, 1)).uniform_(1, 15)
 data = spiral(angles)
 # add some noise to the data
 data += torch.empty((num_data, 2)).normal_(0.0, 0.4)
 labels = torch.zeros((num_data,), dtype=torch.int)
 # flip half of the points to create two classes
 data[num_data//2:,:] *= -1
 labels[num_data//2:] = 1
 return data, labels

In [None]:
# Generate the training set with 4000 examples by function generate_data

X_train, y_train = generate_data(4000)
X_train.size()

In [None]:
# Define the vis_data function to visualize the dataset
def vis_data(X, y):
 plt.figure(figsize=(5, 5))
 plt.plot(X[y==1, 0], X[y==1, 1], 'r+') #Examples are represented as red plusses for label 1
 plt.plot(X[y==0, 0], X[y==0, 1], 'b+') #Examples are represented as blue plusses for label 0 

We can now invoke the `vis_data` function on the dataset previously generated to see what it looks like:

In [None]:
vis_data(X_train, y_train) # visualize training set

We use the `TensorDataset` wrapper from pytorch, so that the framework can easily understand our tensors as a proper dataset.

In [None]:
from torch.utils.data import TensorDataset, DataLoader
training_set = TensorDataset(X_train, y_train)

### Training the model with a neural network

Here is a skeleton of a neural network with a single layer (thus: a linear classifier). This is the model you'll work on to improve it during this exercise.

Look at the code and run it to see the structure, then follow the questions below to iteratively improve the model.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

At the first step, we define a neural network with just two layers. A useful tutorial for constructing model can be found [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py).

In [None]:
# Basic network structure with a single layer
class Model(nn.Module):
 
 def __init__(self):
 super(Model, self).__init__()
 # A single linear layer
 # The model has 2 inputs (the coordinates of the point) and an output (the prediction)
 self.l1 = nn.Linear(2, 10)
 self.l2 = nn.Linear(10, 1)
 
 
 def forward(self, inputs):
 # We want the model to predict 0 for one class and 1 for the other class
 # A Sigmoid activation function seems appropriate
 h = torch.relu(self.l1(inputs))
 outputs = torch.sigmoid(self.l2(h))
 
 return outputs

In [None]:
# Create the model: 
model = Model()

# Choose the hyperparameters for training: 
num_epochs = 10
batch_size = 10

# Training criterion. This one is a mean squared error (MSE) loss between the output
# of the network and the target label
criterion = nn.MSELoss()

# Use SGD optimizer with a learning rate of 0.01
# It is initialized on our model
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

#### Training the defined model
More information can be found [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py).

In [None]:
# define a function for training
model.train()
def train(num_epochs, batch_size, criterion, optimizer, model, dataset):
 train_error = []
 train_loader = DataLoader(dataset, batch_size, shuffle=True)
 model.train()
 for epoch in range(num_epochs):
 epoch_average_loss = 0.0
 for (X_batch, y_real) in train_loader:
 y_pre = model(X_batch).view(-1)
 loss = criterion(y_pre, y_real.float())
 optimizer.zero_grad()
 loss.backward()
 optimizer.step()
 epoch_average_loss += loss.item() * batch_size / len(dataset)
 train_error.append(epoch_average_loss)
 print('Epoch [{}/{}], Loss: {:.4f}'
 .format(epoch+1, num_epochs, epoch_average_loss))
 return train_error

In [None]:
train_error = train(num_epochs, batch_size, criterion, optimizer, model, training_set)

In [None]:
# plot the training error wrt. the number of epochs: 
plt.plot(range(1, num_epochs+1), train_error)
plt.xlabel("num_epochs")
plt.ylabel("Train error")
plt.title("Visualization of convergence")

#### Evaluate the model on the validation set

In [None]:
# Generate 1000 validation data:
X_val, y_val = generate_data(1000)

In [None]:
# predict labels for validation set
model.eval() # set the model to test mode
with torch.no_grad():
 y_pre = model(X_val).view(-1)
 #loss = criterion(y_val, y_pre.float())
 #print(loss.item())

In [None]:
# Calculate the accuracy on validation set to evaluate the model by the function accuracy
def accuracy(y_real, y_pre):
 y_pre[y_pre<0.5] = 0
 y_pre[y_pre>=0.5] = 1

 acc = 1 - torch.sum(torch.abs(y_pre - y_real))/len(y_pre)
 print('Accuracy of the network on the 1000 validation data: {:.2f} %'.format(acc.item()*100))

In [None]:
accuracy(y_val, y_pre)

In [None]:
# Compare the prediction with real labels

def compare_pred(X, y_real, y_pre):
 plt.figure(figsize=(10, 5))

 plt.subplot(121)
 plt.plot(X[y_real==1, 0], X[y_real==1, 1], 'r+') #Examples are represented as a red plusses for label 1
 plt.plot(X[y_real==0, 0], X[y_real==0, 1], 'b+') #Examples are represented as a blue plusses for label 0
 plt.title("real data")

 plt.subplot(122)
 plt.plot(X[y_pre==1, 0], X[y_pre==1, 1], 'r+')
 plt.plot(X[y_pre==0, 0], X[y_pre==0, 1], 'b+')
 plt.title("prediciton results")

In [None]:
compare_pred(X_val, y_val, y_pre)

### Exercise 1: Impact of the architecture of the model

The class `Model` is the definition of your model. You can now modify it to try out different architectures and
see the impact of the following factors:

* Try to add more layers (1, 2, 3, more ?)
* Try to different activation functions ([sigmoid](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid), [tanh](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh), [relu](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu), etc.)
* Try to change the number of neurons for each layer (5, 10, 20, more ?)

### Exercise 2: Impact of the optimizer

Retrain the model by using different parameters of the optimizer, you can change its parameter in the cell initializing it, after the definition of your model.

* Use different batch size from 10 to 400
* Try different values of the learning rate (between 0.001 and 10), and see how these impact the trainig process. Do all network architectures react the same way to different learning rates?
* Change the duration of the training by increasing the number of epochs
* Try other optimizers, such as [Adam](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam) or [RMSprop](https://pytorch.org/docs/stable/optim.html?highlight=rmsprop#torch.optim.RMSprop)

**Note:** These changes may interact with your previous choices of architectures, and you may need to change them as well!

### Exercise 3: Impact of the loss function

The current model uses a mean square error (MSE) loss. While this loss can be used in this case, it is now rarely used for classification, and instead a Binary Cross Entropy (BCE) is used. It consists in interpreting the output of the network as the probability $p(y | x)$ of the point $x$ to belong to the class $y$, and in maximizing the probability to be correct for all samples $x$, that is, in maximizing $\displaystyle \prod_{(x,y) \in Dataset} p(y|x)$. Applying $-\log$ to this quantity, we obtain the following criterion to minimize:

$$ \sum_{(x,y) \in Dataset} - \log p(y | x) $$

This is implemented as such by the [BCELoss](https://pytorch.org/docs/stable/nn.html?highlight=bce#torch.nn.BCELoss) of pytorch. Note that this criterion requires its input to be a probability, i.e. in $[0,1]$, which requires the use of an appropriate activation function beforehand, e.g., a sigmoid.

It turns out that, for numerical stability reasons, it is better to incorporate this sigmoid and the BCELoss into a single function; this is done by the [BCEWithLogitsLoss](https://pytorch.org/docs/stable/nn.html?highlight=bcewithlogit#torch.nn.BCEWithLogitsLoss). Try to replace the MSE by this one and see how this changes the behavior in the network. This can also interact with the changes of the two previous exercices.

**Note:** As a consequence, when using the BCEWithLogitsLoss, the last layer of your network should not be followed by an activation function, as BCEWithLogitsLoss already adds a sigmoid.

### Exercise 4: Prediction on test set

Once you have a model that seems satisfying on the validation dataset, you SHOULD evaluate it on a test dataset that has never been used before, to obtain a final accuracy value.

In [None]:
# Here is a test dataset. Use it similarly to the validaiton dataset above
# to compute the final performance of your model
X_test, y_test = generate_data(500)