{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "DKpp1EzDQXDt" }, "source": [ "### Obejective\n", "We have to make a model which translate Italian to English\n", "\n", "### Basic Information\n", "\n", "
\n", "1. Download the Italian to English translation dataset from here\n", "\n", "2. Preprocess that data. \n", "\n", "3. Encoder and Decoder architecture with \n", "\n", "Encoder - with 1 layer LSTM \n", "Decoder - with 1 layer LSTM\n", "attention - \n", "\n", "4. In Global attention, we have 3 types of scoring functions.\n", " As a part of this assignment you need to create 3 models for each scoring function.\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "_ZWC7laEhJGg" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "import re\n", "import tensorflow as tf\n", "from tqdm import tqdm\n", "import math\n", "import os\n", "import time\n", "import matplotlib.ticker as ticker\n", "import random\n", "import nltk.translate.bleu_score as bleu\n", "from sklearn.model_selection import train_test_split\n", "import joblib\n", "import pickle\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.layers import Input, Embedding,Flatten,Dense,Concatenate,BatchNormalization,Dropout,Conv2D,Conv1D,MaxPooling1D,LSTM,Softmax,GRU\n", "from tensorflow.keras.models import Model\n", "%load_ext tensorboard" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "H7TbUw7Pr29U", "outputId": "18475614-e31c-4730-aab5-90e25fa3f691" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mounted at /content/drive/\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive/')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "YSGt2UqvdgTF" }, "outputs": [], "source": [ "txt=open('/content/drive/My Drive/seq2seq/ita-eng/Ita.txt','r')\n", "d=txt.readlines()" ] }, { "cell_type": "markdown", "metadata": { "id": "-RJKU_Rzwmqo" }, "source": [ "## PRE PROCESSING" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "T-PbZENShVw9" }, "outputs": [], "source": [ "def pre_txt(data):\n", " eng=[]\n", " ita=[]\n", " for i in tqdm(data):\n", " u=i.lower()\n", " u=re.sub(r\"'m\", ' am', u)\n", " u=re.sub(r\"'ll\", ' will', u)\n", " u=re.sub(r\"'d\", ' had', u)\n", " u=re.sub(r\"'s\", ' is', u)\n", " u=re.sub(r\"'ve\", ' have', u)\n", " u=re.sub(r\"'re\", ' are', u)\n", " u=re.sub(r\"won't\", 'would not', u)\n", " u=re.sub(r\"can't\", 'can not', u)\n", " u=re.sub(r\"o'clock\", '', u)\n", " u=re.sub(r\"n't\", ' not ', u)#\"haven't\", ' don't\n", " u=re.sub(r\"([?.!,¿])\", r\" \\1 \", u)\n", "\n", " u=u.split('\\t')\n", " p= re.sub(r\"[^a-zA-Z?.!,¿]+\", \" \", u[0])\n", " q= re.sub(r\"[^a-zA-Z?.!,¿]+\", \" \", u[1])\n", " eng_inp='\n", " In model 1 you need to implemnt \"dot\" score function\n", " In model 3 you need to implemnt \"concat\" score function\n", " \n", "\n", "5. Using attention weights, we have plot the attention plots.\n", "\n", "6. BLEU score as metric to evaluate the model and SparseCategoricalCrossentropy as a loss.\n", " \n", "