dl_in_practice/hw1_basics/TP1_Ex4_Regression.ipynb · MVA-2021

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression Problem: House Prices Prediction\n",
    "In this exercise, we will solve a regression problem with a neural network.\n",
    "\n",
    "**Objective:** The goal is to predict the house selling prices .\n",
    "\n",
    "**Dataset:**  A csv file with 1460 samples is provided (on the course webpage). Each example contains four input features. We will use 1000 examples as training set, 200 as validation set and the rest as test set.   \n",
    "   * **Feature names**: OverallQual, YearBuilt, TotalBsmtSF, GrLivArea\n",
    "   * **Target**: SalePrice\n",
    "\n",
    "**NB:** new required libraries: `pandas`, `seaborn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "\n",
    "import torch.nn.functional as F\n",
    "from torch.utils.data import DataLoader, TensorDataset\n",
    "import seaborn as sns\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data:\n",
    "df = pd.read_csv(\"house_prices.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info() # get more information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Analysis\n",
    "Before training, we need first to analyze the dataset, to know its properties better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(df, x_vars=['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea'], \n",
    "             y_vars=['SalePrice'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### House prices prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a skeleton of a neural network with a single layer (thus: a linear classifier). This is the model you'll start with and improve during this exercise.\n",
    "\n",
    "Look at the code and run it to see its structure, then follow the questions below to iteratively improve the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']] # get the four features from the dataframe\n",
    "y = df['SalePrice'] # get the target values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X.iloc[:1000]\n",
    "y_train = y.iloc[:1000]\n",
    "\n",
    "X_val = X.iloc[1000:1200]\n",
    "y_val = y.iloc[1000:1200]\n",
    "\n",
    "X_test = X.iloc[1200:]\n",
    "y_test = y.iloc[1200:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Construct a model with one layer\n",
    "class Model(nn.Module):\n",
    "    \n",
    "    def __init__(self):\n",
    "        super(Model, self).__init__()\n",
    "        \n",
    "        self.l1 = nn.Linear(4, 1)\n",
    "        \n",
    "    def forward(self, inputs):\n",
    "        outputs = self.l1(inputs)\n",
    "        return outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define hyper-parameters:\n",
    "model = Model()\n",
    "\n",
    "# Choose the hyperparameters for training: \n",
    "num_epochs = 10\n",
    "batch_size = 10\n",
    "\n",
    "# Training criterion. This one is a mean squared error (MSE) loss between the output\n",
    "# of the network and the target label\n",
    "criterion = nn.MSELoss()\n",
    "\n",
    "# Use SGD optimizer with a learning rate of 0.01\n",
    "# It is initialized on our model\n",
    "optimizer = torch.optim.SGD(model.parameters(), lr=0.01)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_set = TensorDataset(torch.from_numpy(np.array(X_train)).float(), \n",
    "                          torch.from_numpy(np.array(y_train)).float()) # creat the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train(num_epochs, batch_size, criterion, optimizer, model, dataset):\n",
    "    dataloader = DataLoader(dataset, batch_size, shuffle=True)\n",
    "\n",
    "    model.train()\n",
    "    for epoch in range(num_epochs):\n",
    "        epoch_average_loss = 0.0\n",
    "        for (X, y) in (dataloader):\n",
    "\n",
    "            y_pre = model(X).view(-1)\n",
    "            loss = criterion(y_pre, y)\n",
    "            optimizer.zero_grad()\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            epoch_average_loss += loss.item() * batch_size / len(train_set)\n",
    "            \n",
    "        if ((epoch+1)%1 == 0):\n",
    "                print('Epoch [{}/{}], Loss_error: {:.4f}'\n",
    "                      .format(epoch+1, num_epochs,  epoch_average_loss))\n",
    "                "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train(num_epochs, batch_size, criterion, optimizer, model, train_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate the Model on the validation set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate mean squared error on validation set\n",
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    y_pre_val = model(torch.from_numpy(np.array(X_val)).float()).view(-1)\n",
    "error = criterion(y_pre_val, torch.tensor(np.array(y_val)).float()).item()\n",
    "print('The loss on validation set is:', error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1: Impact of the architecture of the model\n",
    "\n",
    "The class `Model` is the definition of your model. You can now modify it to try out different architectures and\n",
    "see the impact of the following factors:\n",
    "\n",
    "* Try to add more layers (1, 2, 3, more ?)\n",
    "* Try different activation functions ([sigmoid](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid), [tanh](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh), [relu](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu), etc.)\n",
    "* Try to change the number of neurons in each layer (5, 10, 20, more ?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Impact of the optimizer\n",
    "\n",
    "Retrain the model with different parameters of the optimizer; you can change then in the cell initializing the optimizer, after the definition of your model.\n",
    "\n",
    "* Use different batch sizes, from 10 to 400 e.g.\n",
    "* Try different values of the learning rate (between 0.001 and 10), and see how they impact the training process. Do all network architectures react the same way to different learning rates?\n",
    "* Change the duration of the training by increasing the number of epochs\n",
    "* Try other optimizers, such as [Adam](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam) or [RMSprop](https://pytorch.org/docs/stable/optim.html?highlight=rmsprop#torch.optim.RMSprop)\n",
    "\n",
    "**Note:** These changes may interact with your previous choices of architectures, and you may need to change them as well!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Impact of the loss function\n",
    "As mensioned before in the first problem (binary classification), one can minimize the negative of log-likelihood of the probability for all samples $x$: $$ \\sum_{(x,y) \\,\\in\\, \\text{Dataset}} - \\log p(y | x) $$ If we define $p(y_i | x_i) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{(y_i - f(x_i))^2}{2}}$, then the loss function becomes the mean squared error. \n",
    "\n",
    "There is another loss function worth to try: the Gaussian likelihood loss function. \n",
    "Rather than predicting a single value $y$ given $x$, we predict a probability distribution over possible answers, which helps dealing with ambiguous cases and expressing uncertainty. To do this, for each possible input $x$, the network will output the parameters of the distribution modeling $p(y|x)$. For instance in our case, we choose to model output distributions with Gaussian distributions $\\mathcal{N}(\\mu, \\sigma)$, which are parameterized by their mean $\\mu$ and their standard deviation $\\sigma$. Therefore for each input $x$ we have to output two quantities: $\\mu(x)$ and $\\sigma(x)$. The probability becomes: $$p(y_i | x_i) = \\frac{1}{\\sqrt{2\\pi \\sigma(x_i)^2}}e^{-\\frac{(y_i - \\mu(x_i))^2}{2\\sigma(x_i)^2}}$$ Then the loss function becomes: $$L =\\sum\\limits_{i=1}^{N}  \\frac{1}{2} \\log ( 2\\pi\\sigma_i^{2} ) + \\frac{1}{2\\sigma_i^{2}}  (y_{i} - \\mu_i)^{2}$$ If we set $\\sigma=1$, we obtain MSE the loss function. \n",
    "\n",
    "* Try to replace the loss function with this one, and compare the differences between the two losses.\n",
    " \n",
    "* **Hints**: \n",
    "    * You need two outputs of your network, one represents the $\\mu(x_i)$, another for $\\log( \\sigma(x_i)^2 )$ (better for optimization) \n",
    "    * Try deeper models, or you will not predict the variance $\\sigma$ well. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercice 4: Prediction on test set\n",
    "\n",
    "* Once you have a model that seems satisfying on the validation dataset, you SHOULD evaluate it on a test dataset that has never been used before, to obtain a final accuracy value.\n",
    "* When using the Gaussian likelihood function, the confidence of the network in its prediction is reflected in the variance it outputs. It can be interesting to check how this uncertainty varies with the data. For example, the uncertainty will decrease when the feature `OverallQual` increases. Plot the variance $\\sigma(x)$ w.r.t one of the three features, on test set, and describe what you observe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}