{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "DKpp1EzDQXDt" }, "source": [ "### Obejective\n", "We have to make a model which translate Italian to English\n", "\n", "### Basic Information\n", "\n", "<pre>\n", "1. Download the Italian to English translation dataset from <a href=\"http://www.manythings.org/anki/ita-eng.zip\">here</a>\n", "\n", "2. Preprocess that data. \n", "\n", "3. Encoder and Decoder architecture with \n", "\n", "Encoder - with 1 layer LSTM \n", "Decoder - with 1 layer LSTM\n", "attention - \n", "\n", "4. In Global attention, we have 3 types of scoring functions.\n", " As a part of this assignment <strong>you need to create 3 models for each scoring function.</strong>\n", "<img src='https://i.imgur.com/iD2jZo3.png'>\n", " In model 1 you need to implemnt \"dot\" score function\n", " In model 3 you need to implemnt \"concat\" score function\n", " \n", "\n", "5. Using attention weights, we have plot the attention plots.\n", "\n", "6. BLEU score as metric to evaluate the model and SparseCategoricalCrossentropy as a loss.\n", " \n", "</pre>" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "_ZWC7laEhJGg" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "import re\n", "import tensorflow as tf\n", "from tqdm import tqdm\n", "import math\n", "import os\n", "import time\n", "import matplotlib.ticker as ticker\n", "import random\n", "import nltk.translate.bleu_score as bleu\n", "from sklearn.model_selection import train_test_split\n", "import joblib\n", "import pickle\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.layers import Input, Embedding,Flatten,Dense,Concatenate,BatchNormalization,Dropout,Conv2D,Conv1D,MaxPooling1D,LSTM,Softmax,GRU\n", "from tensorflow.keras.models import Model\n", "%load_ext tensorboard" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "H7TbUw7Pr29U", "outputId": "18475614-e31c-4730-aab5-90e25fa3f691" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mounted at /content/drive/\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive/')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "YSGt2UqvdgTF" }, "outputs": [], "source": [ "txt=open('/content/drive/My Drive/seq2seq/ita-eng/Ita.txt','r')\n", "d=txt.readlines()" ] }, { "cell_type": "markdown", "metadata": { "id": "-RJKU_Rzwmqo" }, "source": [ "## PRE PROCESSING" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "T-PbZENShVw9" }, "outputs": [], "source": [ "def pre_txt(data):\n", " eng=[]\n", " ita=[]\n", " for i in tqdm(data):\n", " u=i.lower()\n", " u=re.sub(r\"'m\", ' am', u)\n", " u=re.sub(r\"'ll\", ' will', u)\n", " u=re.sub(r\"'d\", ' had', u)\n", " u=re.sub(r\"'s\", ' is', u)\n", " u=re.sub(r\"'ve\", ' have', u)\n", " u=re.sub(r\"'re\", ' are', u)\n", " u=re.sub(r\"won't\", 'would not', u)\n", " u=re.sub(r\"can't\", 'can not', u)\n", " u=re.sub(r\"o'clock\", '', u)\n", " u=re.sub(r\"n't\", ' not ', u)#\"haven't\", ' don't\n", " u=re.sub(r\"([?.!,¿])\", r\" \\1 \", u)\n", "\n", " u=u.split('\\t')\n", " p= re.sub(r\"[^a-zA-Z?.!,¿]+\", \" \", u[0])\n", " q= re.sub(r\"[^a-zA-Z?.!,¿]+\", \" \", u[1])\n", " eng_inp='<sos> ' + p + '<eos>'\n", " ita_inp='<sos> ' + q + '<eos>'\n", " if ita_inp.split('<eos>')[0][-1].isalpha()==True:\n", " ita_inp=ita_inp.replace('<eos>',' <eos>')\n", "\n", " eng.append(eng_inp)\n", " ita.append(ita_inp)\n", "\n", " return eng,ita" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JQktTGxU2P1O", "outputId": "0107c782-50ba-4669-c4e3-631533c25f82" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 336614/336614 [00:08<00:00, 39534.58it/s]\n" ] } ], "source": [ "eng_txt,ita_txt=np.array(pre_txt(d))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tt0S_EbY56GT", "outputId": "8408d8ef-1e7e-4ce4-bc00-a09dcc89daaf" }, "outputs": [ { "data": { "text/plain": [ "((336614,), (336614,))" ] }, "execution_count": 7, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "eng_txt.shape,ita_txt.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "PDyV0njeE_N7" }, "source": [ "## WORD ANALYSIS" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 614 }, "id": "JCGKlvtRQHvJ", "outputId": "5dd5174a-264c-4411-c979-336f4d017f7a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pdf : [1.85449803e-01 5.58140778e-01 2.14444438e-01 3.42766492e-02\n", " 5.08297338e-03 1.07541576e-03 5.82269305e-04 2.28748656e-04\n", " 2.31719417e-04 1.18830471e-04 1.66362659e-04 7.72398058e-05\n", " 3.26783794e-05 3.86199029e-05 1.18830471e-05 1.78245706e-05\n", " 2.97076176e-06 2.07953323e-05] \n", "\n", "bin edge : [ 4. 6.83333333 9.66666667 12.5 15.33333333 18.16666667\n", " 21. 23.83333333 26.66666667 29.5 32.33333333 35.16666667\n", " 38. 40.83333333 43.66666667 46.5 49.33333333 52.16666667\n", " 55. ] \n", "\n", "outlier : [0.18544980303849506, 0.7435905814969073, 0.958035019339659, 0.9923116685580515, 0.9973946419340846, 0.9984700576921932, 0.9990523269976885, 0.9992810756534188, 0.9995127950709118, 0.999631625541421, 0.999797988200134, 0.999875228005965, 0.999907906385355, 0.9999465262882705, 0.9999584093353214, 0.9999762339058978, 0.9999792046676605, 0.9999999999999997]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 576x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "counts, bin_edges = np.histogram([len(i.split(' ')) for i in ita_txt], bins=18,density = True,)\n", "pdf = counts/(sum(counts))\n", "print('pdf : ',pdf,'\\n');\n", "print('bin edge : ',bin_edges,'\\n')\n", "cdf = np.cumsum(pdf)\n", "plt.figure(figsize=(8,6))\n", "plt.plot(bin_edges[1:],pdf,label='Histogram of Italian Text')\n", "plt.plot(bin_edges[1:], cdf,label='Cumulative distribution of Italian Text')\n", "plt.title('histogram and cumulative distribution of Italian Text')\n", "plt.legend()\n", "plt.grid()\n", "c=0\n", "q=[]\n", "for i in pdf:\n", " c=c+i\n", " q.append(c)\n", "print('outlier : ',q)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 614 }, "id": "IqcfK6OYMqxH", "outputId": "8883d8ed-a27e-4fff-aaee-39cf7c26e008" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pdf : [9.84777817e-02 5.62825670e-01 2.86702276e-01 3.59610711e-02\n", " 1.17612458e-02 2.64397797e-03 6.50596826e-04 8.61520911e-05\n", " 8.31813294e-05 2.64397797e-04 1.72304182e-04 2.19836370e-04\n", " 5.64444735e-05 3.26783794e-05 1.18830471e-05 1.18830471e-05\n", " 2.37660941e-05 1.48538088e-05] \n", "\n", "bin edge : [ 4. 6.72222222 9.44444444 12.16666667 14.88888889 17.61111111\n", " 20.33333333 23.05555556 25.77777778 28.5 31.22222222 33.94444444\n", " 36.66666667 39.38888889 42.11111111 44.83333333 47.55555556 50.27777778\n", " 53. ] \n", "\n", "outlier : [0.09847778167277654, 0.6613034514310161, 0.9480057276286786, 0.9839667987665398, 0.9957280445851926, 0.9983720225540234, 0.9990226193800615, 0.9991087714711807, 0.9991919528005372, 0.9994563505974202, 0.9996286547796587, 0.9998484911501008, 0.9999049356235926, 0.9999376140029826, 0.9999494970500336, 0.9999613800970845, 0.9999851461911863, 0.9999999999999999]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 576x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "counts, bin_edges = np.histogram([len(i.split(' ')) for i in eng_txt], bins=18,density = True,)\n", "pdf = counts/(sum(counts))\n", "print('pdf : ',pdf,'\\n');\n", "print('bin edge : ',bin_edges,'\\n')\n", "cdf = np.cumsum(pdf)\n", "plt.figure(figsize=(8,6))\n", "plt.plot(bin_edges[1:],pdf,label='Histogram of Italian Text')\n", "plt.plot(bin_edges[1:], cdf,label='Cumulative distribution of Italian Text')\n", "plt.title('histogram and cumulative distribution of Italian Text')\n", "plt.legend()\n", "plt.grid()\n", "c=0\n", "q=[]\n", "for i in pdf:\n", " c=c+i\n", " q.append(c)\n", "print('outlier : ',q)" ] }, { "cell_type": "markdown", "metadata": { "id": "s8owcQwh2HOh" }, "source": [ "#### Observation\n", "1. 99% of time each sentence contain less than 16 words. " ] }, { "cell_type": "markdown", "metadata": { "id": "ZMo10m8AFgfP" }, "source": [ "#### TOKENISATION" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "nwtlIhnHbojC" }, "outputs": [], "source": [ "def tokenize(lang):\n", " lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')\n", " lang_tokenizer.fit_on_texts(lang)\n", "\n", " tensor = lang_tokenizer.texts_to_sequences(lang)\n", "\n", " tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,\n", " padding='post')\n", " print(len(lang_tokenizer.word_index)+1)\n", " #len(tok.word_index) + 1\n", "\n", " return tensor, lang_tokenizer" ] }, { "cell_type": "markdown", "metadata": { "id": "1ju-uDYqFo9Y" }, "source": [ "#### REMOVE WORDS LESS THAN 16" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "ksQ6wsHMRW1I" }, "outputs": [], "source": [ "ita_txt_new=[]\n", "eng_txt_new=[]\n", "for t,e in zip(ita_txt,eng_txt):\n", " if len(t.split(' '))<=16 and len(e.split(' '))<=16:\n", " eng_txt_new.append(e)\n", " ita_txt_new.append(t)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "Wv3tNdg0SQy1" }, "outputs": [], "source": [ "eng_txt=eng_txt_new\n", "ita_txt=ita_txt_new" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "A-sbdHPp5ZJD", "outputId": "e4c289cf-857e-4e33-cac4-93bbd0067f81" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25397\n", "12656\n" ] } ], "source": [ "input_tensor, inp_lang_tokenizer = tokenize(ita_txt)\n", "target_tensor, targ_lang_tokenizer = tokenize(eng_txt)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_vKQJ2dfWOcA", "outputId": "e3db42f8-8ff4-4459-dce6-c67686e49bb6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input shape : (334096, 16)\n", "target shape : (334096, 16)\n" ] } ], "source": [ "print('input shape : ',input_tensor.shape)\n", "print('target shape : ',target_tensor.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "0739pHZP-NNL" }, "source": [ "#### Saving all the file" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "id": "vOseIs7AIam7" }, "outputs": [], "source": [ "pickle.dump(input_tensor, open('/content/drive/My Drive/seq2seq/input_tensor', 'wb'))\n", "pickle.dump(input_tensor, open('/content/drive/My Drive/seq2seq/target_tensor', 'wb'))\n", "pickle.dump(inp_lang_tokenizer, open('/content/drive/My Drive/seq2seq/inp_lang_tokenizer', 'wb'))\n", "pickle.dump(targ_lang_tokenizer, open('/content/drive/My Drive/seq2seq/targ_lang_tokenizer', 'wb'))" ] }, { "cell_type": "markdown", "metadata": { "id": "Ol933ERe-Qro" }, "source": [ "#### Loading all the file" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "id": "DwlFvtFE-U-N" }, "outputs": [], "source": [ "input_tensor=pickle.load(open('/content/drive/My Drive/seq2seq/input_tensor', 'rb'))\n", "target_tensor=pickle.load(open('/content/drive/My Drive/seq2seq/target_tensor', 'rb'))\n", "inp_lang_tokenizer=pickle.load(open('/content/drive/My Drive/seq2seq/inp_lang_tokenizer', 'rb'))\n", "targ_lang_tokenizer=pickle.load(open('/content/drive/My Drive/seq2seq/targ_lang_tokenizer', 'rb'))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kPFGPldLQ4v2", "outputId": "c09c8417-85b4-4690-f17d-cf8585b758ac" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input shape : (334096, 16)\n", "target shape : (334096, 16)\n" ] } ], "source": [ "print('input shape : ',input_tensor.shape)\n", "print('target shape : ',target_tensor.shape)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "tGPBB1E1Q_-F" }, "outputs": [], "source": [ "decoder_input_target_tensor=[]\n", "for i in target_tensor:\n", " i=list(i)\n", " if 1 in i:\n", " i.remove(2)\n", " i.append(0) \n", " decoder_input_target_tensor.append(i) \n", "decoder_input_target_tensor=np.array(decoder_input_target_tensor)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "QfFXFbibTeYt" }, "outputs": [], "source": [ "decoder_output_target_tensor=[]\n", "for i in target_tensor:\n", " i=list(i)\n", " if 1 in i:\n", " i.remove(1)\n", " i.append(0) \n", " decoder_output_target_tensor.append(i) \n", "decoder_output_target_tensor=np.array(decoder_output_target_tensor)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "AYS9xc_tWsyz" }, "outputs": [], "source": [ "input_tensor_train, input_tensor_val,decoder_input_target_tensor_train,decoder_input_target_tensor_val ,decoder_output_target_tensor_train, decoder_output_target_tensor_val, = train_test_split(input_tensor, decoder_input_target_tensor,decoder_output_target_tensor, test_size=0.18,random_state=42)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wznid5K_Ws86", "outputId": "329081db-cf6f-4ce3-e0d2-b83fddc6da54" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train input size : (273958, 16)\n", "train input-output size : (273958, 16)\n", "train output-output size : (273958, 16)\n" ] } ], "source": [ "print('train input size : ',input_tensor_train.shape)\n", "print('train input-output size : ',decoder_input_target_tensor_train.shape)\n", "print('train output-output size : ',decoder_output_target_tensor_train.shape)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "D4sFbGB-2pWA", "outputId": "c05579b8-8df7-416c-8642-372a33956528" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train input size : (60138, 16)\n", "train input-output size : (60138, 16)\n", "train output-output size : (60138, 16)\n" ] } ], "source": [ "print('train input size : ',input_tensor_val.shape)\n", "print('train input-output size : ',decoder_input_target_tensor_val.shape)\n", "print('train output-output size : ',decoder_output_target_tensor_val.shape)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2B4Rgdf9cHzJ", "outputId": "31ee9e6b-243a-407b-c441-d6537f442d8e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Language; index to word mapping\n", "1 ----> <sos>\n", "5601 ----> salti\n", "3 ----> .\n", "2 ----> <eos>\n", "\n", "Target input Language; index to word mapping\n", "1 ----> <sos>\n", "1995 ----> jump\n", "3 ----> .\n", "\n", "Target output Language; index to word mapping\n", "1995 ----> jump\n", "3 ----> .\n", "2 ----> <eos>\n" ] } ], "source": [ "def convert(lang, tensor):\n", " for t in tensor:\n", " if t!=0:\n", " print (\"%d ----> %s\" % (t, lang.index_word[t]))\n", "\n", "print (\"Input Language; index to word mapping\")\n", "convert(inp_lang_tokenizer, input_tensor[10])\n", "print()\n", "print (\"Target input Language; index to word mapping\")\n", "convert(targ_lang_tokenizer, decoder_input_target_tensor[10])\n", "print()\n", "print (\"Target output Language; index to word mapping\")\n", "convert(targ_lang_tokenizer, decoder_output_target_tensor[10])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2qASFQn2wUHO", "outputId": "1b4f1ccd-67ea-4e87-c8af-29996129334a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train input size : (273552, 16)\n", "train input-output size : (273552, 16)\n", "train output-output size : (273552, 16)\n" ] } ], "source": [ "s=273552\n", "input_tensor_train=input_tensor_train[:s]#269280\n", "decoder_input_target_tensor_train=decoder_input_target_tensor_train[:s]\n", "decoder_output_target_tensor_train=decoder_output_target_tensor_train[:s]\n", "\n", "\n", "print('train input size : ',input_tensor_train.shape)\n", "print('train input-output size : ',decoder_input_target_tensor_train.shape)\n", "print('train output-output size : ',decoder_output_target_tensor_train.shape)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YcbqO9VPOM9l", "outputId": "47c421f9-6b43-4ea0-dd39-a18c0c8a18bc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train input size : (60048, 16)\n", "train input-output size : (60048, 16)\n", "train output-output size : (60048, 16)\n" ] } ], "source": [ "p=60048\n", "\n", "input_tensor_val=input_tensor_val[:p]#67296\n", "decoder_input_target_tensor_val=decoder_input_target_tensor_val[:p]\n", "decoder_output_target_tensor_val=decoder_output_target_tensor_val[:p]\n", "\n", "print('train input size : ',input_tensor_val.shape)\n", "print('train input-output size : ',decoder_input_target_tensor_val.shape)\n", "print('train output-output size : ',decoder_output_target_tensor_val.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "g8Fmp66i54FB" }, "source": [ "## ENCODER" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "J7oPqwi4thxi" }, "outputs": [], "source": [ "class Encoder(tf.keras.Model):\n", "\n", " def __init__(self,vocab_size,embedding_size,lstm_size,input_length):\n", " super().__init__()\n", " self.vocab_size = vocab_size\n", " self.embedding_size = embedding_size\n", " self.input_length = input_length\n", " self.lstm_size= lstm_size\n", " self.lstm_output = 0\n", " self.state_h=0\n", " self.state_c=0\n", " self.embedding = Embedding(input_dim=self.vocab_size, output_dim=self.embedding_size, input_length=self.input_length) \n", " self.lstm = LSTM(self.lstm_size, return_state=True, return_sequences=True, name=\"Encoder_LSTM\")\n", "\n", " def call(self,input_sequence,states):\n", " input_embedd = self.embedding(input_sequence)\n", " self.lstm_output, self.lstm_state_h,self.lstm_state_c = self.lstm(input_embedd,initial_state = states)\n", " return self.lstm_output, self.lstm_state_h,self.lstm_state_c\n", " \n", " def initialize_states(self,batch_size):\n", " return tf.zeros((batch_size, self.lstm_size)),tf.zeros((batch_size, self.lstm_size)) " ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cPbb4ajLth1b", "outputId": "8a4175ca-3115-461e-9d5f-267f30a0bed2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "def grader_check_encoder():\n", " vocab_size=12\n", " embedding_size=20\n", " lstm_size=32\n", " input_length=8\n", " batch_size=16\n", " encoder=Encoder(vocab_size,embedding_size,lstm_size,input_length)\n", " input_sequence=tf.random.uniform(shape=[batch_size,input_length],maxval=vocab_size,minval=0,dtype=tf.int32)\n", " initial_state=encoder.initialize_states(batch_size)\n", " print\n", " encoder_output,state_h,state_c=encoder(input_sequence,initial_state)\n", " \n", " assert(encoder_output.shape==(batch_size,input_length,lstm_size) and state_h.shape==(batch_size,lstm_size) and state_c.shape==(batch_size,lstm_size))\n", " return True\n", "print(grader_check_encoder())" ] }, { "cell_type": "markdown", "metadata": { "id": "WdS6h_jC58VJ" }, "source": [ "## ATTENTION" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "f9vMVeektiJD" }, "outputs": [], "source": [ "class Attention(tf.keras.Model):\n", " \n", " def __init__(self,scoring_function, att_units):\n", " super().__init__()\n", " self.scoring_function=scoring_function\n", " self.att_units=att_units\n", " self.softmax=Softmax()\n", " if self.scoring_function=='dot':\n", " pass\n", " \n", " elif scoring_function == 'concat':\n", " self.W = tf.keras.layers.Dense(att_units,activation='relu',kernel_initializer='he_uniform')\n", " self.V = tf.keras.layers.Dense(1)\n", " \n", " def call(self,decoder_hidden_state,encoder_output):\n", " '''\n", " Attention mechanism takes two inputs current step -- decoder_hidden_state and all the encoder_outputs.\n", " * Based on the scoring function we will find the score or similarity between decoder_hidden_state and encoder_output.\n", " Multiply the score function with your encoder_outputs to get the context vector.\n", " Function returns context vector and attention weights(softmax - scores)\n", " '''\n", " \n", " if self.scoring_function == 'dot':\n", " state_h=decoder_hidden_state\n", " state= tf.expand_dims(state_h, 1)\n", " prob=[]\n", " for i in range(encoder_output.shape[0]):\n", " eo=tf.transpose(encoder_output[i])\n", " dot=tf.matmul(state[i],eo)\n", " soft_out=self.softmax(dot[0])\n", " prob.append(soft_out)\n", " \n", " attention_weights=tf.reshape(tf.convert_to_tensor(prob),(encoder_output.shape[0],encoder_output.shape[1],1))\n", " context_vector=attention_weights * encoder_output\n", " context_vector = tf.reduce_sum(context_vector, axis=1)\n", " return context_vector,attention_weights\n", "\n", " \n", " elif self.scoring_function == 'concat':\n", " state = tf.expand_dims(decoder_hidden_state, 1)\n", " state= tf.tile(state,[1,encoder_output.shape[1],1])\n", " score=self.V(tf.nn.tanh(self.W(tf.concat([encoder_output,state],axis=-1))))\n", " score=tf.transpose(score,[0,2,1])\n", " attention_weights = tf.nn.softmax(score,axis=2)\n", " context_vector = tf.matmul(attention_weights , encoder_output)\n", " context_vector=tf.reshape(context_vector,shape=(context_vector.shape[0],context_vector.shape[2]))\n", " attention_weights=tf.reshape(attention_weights,shape=(attention_weights.shape[0],attention_weights.shape[2],attention_weights.shape[1]))\n", " return context_vector, attention_weights" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zieQLhsmlCoM", "outputId": "643aaed3-a2b8-49f7-c0a4-97446f6b3f2e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n" ] } ], "source": [ "def grader_check_attention(scoring_fun):\n", " \n", " input_length=10\n", " batch_size=16\n", " att_units=32\n", " state_h=tf.random.uniform(shape=[batch_size,att_units])\n", " encoder_output=tf.random.uniform(shape=[batch_size,input_length,att_units])\n", " attention=Attention(scoring_fun,att_units)\n", " context_vector,attention_weights=attention(state_h,encoder_output)\n", " assert(context_vector.shape==(batch_size,att_units) and attention_weights.shape==(batch_size,input_length,1))\n", " return True\n", "print(grader_check_attention('dot'))\n", "print(grader_check_attention('concat'))" ] }, { "cell_type": "markdown", "metadata": { "id": "7rWr9QJa-oR6" }, "source": [ "## ONE STEP DECODER" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "id": "bWPrctVYONrA" }, "outputs": [], "source": [ "class One_Step_Decoder(tf.keras.Model):\n", " def __init__(self,tar_vocab_size, embedding_dim, input_length, dec_units ,score_fun ,att_units):\n", " # Initialize decoder embedding layer, LSTM and any other objects needed\n", " super().__init__()\n", " self.tar_vocab_size = tar_vocab_size\n", " self.embedding_dim = embedding_dim\n", " self.input_length = input_length\n", " self.dec_units= dec_units\n", " self.score_fun = score_fun\n", " self.att_units=att_units\n", " self.attention=Attention(score_fun,att_units)\n", " self.softmax=Softmax()\n", " self.dense=Dense(self.tar_vocab_size)\n", " self.embedding = Embedding(input_dim=self.tar_vocab_size, output_dim=self.embedding_dim, \n", " input_length=1)\n", " self.lstm = LSTM(self.dec_units, return_state=True, return_sequences=True, name=\"Encoder_LSTM\")\n", "\n", " def call(self,input_to_decoder, encoder_output, state_h,state_c):\n", "\n", " #A\n", " emb=self.embedding(input_to_decoder)\n", " #B\n", " context_vector,attention_weights=self.attention(state_h,encoder_output)\n", " context_vector=tf.expand_dims(context_vector,1)\n", " #C\n", " con=Concatenate()([emb,context_vector])\n", " #D\n", " decoder_out,hidden_state,cell_state=self.lstm(con,initial_state = [state_h,state_c])\n", " dense_out=self.dense(decoder_out)\n", " \n", " return tf.reshape(dense_out,(dense_out.shape[0],dense_out.shape[2])),hidden_state,cell_state,attention_weights,tf.reshape(context_vector,(context_vector.shape[0],context_vector.shape[2]))\n", " \n", " #One step decoder mechanisim step by step:\n", " #A. Pass the input_to_decoder to the embedding layer and then get the output(1,1,embedding_dim)\n", " #B. Using the encoder_output and decoder hidden state, compute the context vector.\n", " #C. Concat the context vector with the step A output\n", " #D. Pass the Step-C output to LSTM/GRU and get the decoder output and states(hidden and cell state)\n", " #E. Pass the decoder output to dense layer(vocab size) and store the result into output.\n", " #F. Return the states from step D, output from Step E, attention weights from Step -B\n", " " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1ZJd9looONnj", "outputId": "ed28361f-78c5-4294-fa1a-6a38be3a24c1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n" ] } ], "source": [ "def grader_onestepdecoder(score_fun):\n", " vocab_size=13 \n", " embedding_dim=12 \n", " input_length=10\n", " dec_units=16 \n", " att_units=16\n", " batch_size=32\n", " onestepdecoder=One_Step_Decoder(vocab_size, embedding_dim, input_length, dec_units ,score_fun ,att_units)\n", " input_to_decoder=tf.random.uniform(shape=(batch_size,1),maxval=10,minval=0,dtype=tf.int32)\n", " encoder_output=tf.random.uniform(shape=[batch_size,input_length,dec_units])\n", " state_h=tf.random.uniform(shape=[batch_size,dec_units])\n", " state_c=tf.random.uniform(shape=[batch_size,dec_units])\n", " output,state_h,state_c,attention_weights,context_vector=onestepdecoder(input_to_decoder,encoder_output,state_h,state_c)\n", " assert(output.shape==(batch_size,vocab_size))\n", " assert(state_h.shape==(batch_size,dec_units))\n", " assert(state_c.shape==(batch_size,dec_units))\n", " assert(attention_weights.shape==(batch_size,input_length,1))\n", " assert(context_vector.shape==(batch_size,dec_units))\n", " \n", " return True\n", " \n", "\n", "print(grader_onestepdecoder('dot'))\n", "print(grader_onestepdecoder('concat')) " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "id": "CM61UZ2_5iq1" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "_TDJ4T7dFUMZ" }, "source": [ "## DECODER" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "pxGuuj4Jo6t9" }, "outputs": [], "source": [ "class Decoder(tf.keras.Model):\n", " def __init__(self,out_vocab_size, embedding_dim, output_length, dec_units ,score_fun ,att_units,input_length):\n", " super().__init__()\n", " self.onestepDecoder=One_Step_Decoder(out_vocab_size, embedding_dim, input_length, dec_units ,score_fun ,att_units)\n", " \n", "\n", " def call(self, input_to_decoder,encoder_output,decoder_hidden_state,decoder_cell_state ):\n", " #Initialize an empty Tensor array, that will store the outputs at each and every time step\n", " all_outputs=tf.TensorArray(tf.float32,size=tf.shape(input_to_decoder)[1], name='output_array')\n", " \n", " for timestep in range(0,tf.shape(input_to_decoder)[1]):\n", " output,decoder_hidden_state,decoder_cell_state,_,_=self.onestepDecoder(input_to_decoder[:,timestep:timestep+1],encoder_output,decoder_hidden_state,decoder_cell_state)\n", " #storing the one step decoder outputs to the tensor array\n", " \n", " all_outputs=all_outputs.write(timestep,output)\n", " \n", " all_outputs=tf.transpose(all_outputs.stack(), [1,0,2])\n", " return all_outputs" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8VnI2hJWFKfv", "outputId": "4bd78e80-5a00-487b-a835-6f20f20d04a3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n" ] } ], "source": [ "def grader_decoder(score_fun):\n", " out_vocab_size=13 \n", " embedding_dim=12 \n", " input_length=10\n", " output_length=11\n", " dec_units=16 \n", " att_units=16\n", " batch_size=32\n", " \n", " target_sentences=tf.random.uniform(shape=(batch_size,output_length),maxval=10,minval=0,dtype=tf.int32)\n", " encoder_output=tf.random.uniform(shape=[batch_size,input_length,dec_units])\n", " state_h=tf.random.uniform(shape=[batch_size,dec_units])\n", " state_c=tf.random.uniform(shape=[batch_size,dec_units]) \n", " decoder=Decoder(out_vocab_size, embedding_dim, output_length, dec_units ,score_fun ,att_units,input_length)\n", " output=decoder(target_sentences,encoder_output, state_h,state_c)\n", " assert(output.shape==(batch_size,output_length,out_vocab_size))#(32,11,13)\n", " \n", " return True\n", "print(grader_decoder('dot'))\n", "print(grader_decoder('concat'))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "id": "nVXqPxkHqMrI" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "LBwFiOaa4cnX" }, "source": [ "## LOSS" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "id": "eCTHv1kW4eRB" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam()\n", "loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n", " from_logits=True, reduction='none')\n", "\n", "def loss_function(real, pred):\n", " mask = tf.math.logical_not(tf.math.equal(real, 0))\n", " loss_ = loss_object(real, pred)\n", "\n", " mask = tf.cast(mask, dtype=loss_.dtype)\n", " loss_ *= mask\n", "\n", " return tf.reduce_mean(loss_)" ] }, { "cell_type": "markdown", "metadata": { "id": "k_q2WsAm4_l_" }, "source": [ "## ENCODER_DECODER" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "id": "5PewG0bpONdX" }, "outputs": [], "source": [ "class encoder_decoder(tf.keras.Model):\n", " def __init__(self,vocab_inp_size,embedding_size,lstm_units,input_length,batch_size,vocab_tar_size,output_length,scoring_fun):\n", " super().__init__()\n", " #Intialize objects from encoder decoder\n", " #1\n", " self.encoder=Encoder(vocab_inp_size,embedding_size,lstm_units,input_length)\n", " self.initial_state=self.encoder.initialize_states(batch_size)\n", " #2\n", " self.decoder=Decoder(vocab_tar_size, embedding_size, output_length, lstm_units ,scoring_fun ,lstm_units,input_length)\n", "\n", "\n", " def call(self,input):\n", " input_sequence=input[0]\n", " target_sentences=input[1]\n", " encoder_output,state_h,state_c=self.encoder(input_sequence,self.initial_state) \n", " \n", " output=self.decoder(target_sentences,encoder_output, state_h,state_c)\n", " return output" ] }, { "cell_type": "markdown", "metadata": { "id": "hyaCL7Tct9qd" }, "source": [ "#### FUNCTION FOR PREDICTION" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "id": "6HwEwUXHe_Wb" }, "outputs": [], "source": [ "max_length_inp=input_tensor.shape[1]\n", "max_length_targ=target_tensor.shape[1]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "id": "l8Hn4jFJlm6u" }, "outputs": [], "source": [ "def preprocess_sentence(data):\n", " u=data.lower()\n", " u=re.sub(r\"'m\", ' am', u)\n", " u=re.sub(r\"'ll\", ' will', u)\n", " u=re.sub(r\"'d\", ' had', u)\n", " u=re.sub(r\"'s\", ' is', u)\n", " u=re.sub(r\"'ve\", ' have', u)\n", " u=re.sub(r\"'re\", ' are', u)\n", " u=re.sub(r\"won't\", 'would not', u)\n", " u=re.sub(r\"can't\", 'can not', u)\n", " u=re.sub(r\"o'clock\", '', u)\n", " u=re.sub(r\"n't\", ' not ', u)#\"haven't\", ' don't\n", " u=re.sub(r\"([?.!,¿])\", r\" \\1 \", u)\n", " #u=re.sub(r'[^a-zA_Z0-9]',' ',u)\n", " \n", " q= re.sub(r\"[^a-zA-Z?.!,¿]+\", \" \", u)\n", " sen='<sos> ' + q + '<eos>'\n", " return sen" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "0uJrunRMlRaF" }, "outputs": [], "source": [ "def evaluate(sentence):\n", " attention_plot = np.zeros((max_length_targ, max_length_inp))\n", "\n", " sentence = preprocess_sentence(sentence)\n", "\n", " inputs = [inp_lang_tokenizer.word_index[i] for i in sentence.split(' ')]\n", " inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],\n", " maxlen=max_length_inp,\n", " padding='post')\n", " inputs = tf.convert_to_tensor(inputs)\n", " result = ''\n", " enc_hidden = encoder.initialize_states(1)\n", " enc_output, enc_hidden,enc_cell = model.layers[0](inputs, enc_hidden)\n", " dec_hidden = enc_hidden\n", " dec_cell=enc_cell\n", " dec_input = tf.expand_dims([targ_lang_tokenizer.word_index['<sos>']], 0)\n", "\n", " for t in range(max_length_targ):\n", " predictions, dec_hidden, dec_cell,attention_weights,context_vector = model.layers[1].onestepDecoder(dec_input,enc_output,dec_hidden,dec_cell,training=False)\n", " # storing the attention weights to plot later on\n", " attention_weights = tf.reshape(attention_weights, (-1, ))\n", " attention_plot[t] = attention_weights.numpy()\n", " predicted_id = tf.argmax(predictions[0]).numpy()\n", "\n", " result += targ_lang_tokenizer.index_word[predicted_id] + ' '\n", "\n", " if targ_lang_tokenizer.index_word[predicted_id] == '<eos>':\n", " return result, sentence, attention_plot\n", "\n", " # the predicted ID is fed back into the model\n", " dec_input = tf.expand_dims([predicted_id], 0)\n", "\n", " return result, sentence, attention_plot\n", "\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "4N56tLO6lRT6" }, "outputs": [], "source": [ "def plot_attention(attention, sentence, predicted_sentence):\n", " fig = plt.figure(figsize=(10,10))\n", " ax = fig.add_subplot(1, 1, 1)\n", " ax.matshow(attention, cmap='gray')\n", "\n", " fontdict = {'fontsize': 14}\n", "\n", " ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)\n", " ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)\n", "\n", " ax.xaxis.set_major_locator(ticker.MultipleLocator(1))\n", " ax.yaxis.set_major_locator(ticker.MultipleLocator(1))\n", "\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "id": "6Dhn7C42lRLK" }, "outputs": [], "source": [ "def translate(sentence):\n", " result, sentence, attention_plot = evaluate(sentence)\n", "\n", " print('Input: %s' % (sentence))\n", " print('Predicted translation: {}'.format(result))\n", "\n", " attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]\n", " plot_attention(attention_plot, sentence.split(' '), result.split(' '))" ] }, { "cell_type": "markdown", "metadata": { "id": "YtD94W_MuHBU" }, "source": [ "#### FUNCTION FOR BLUE SCORE" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "id": "uVIavA-2nv6I" }, "outputs": [], "source": [ "def tok2word1(data,tokenizer):\n", " a=''\n", " for i in data:\n", " \n", " if tokenizer.index_word[i]=='<eos>':\n", " break\n", " \n", " a=a+' '+tokenizer.index_word[i]\n", " a=a.split('<sos>')[1][1:]+' '\n", " return a\n", "def tok2word2(data,tokenizer):\n", " a=''\n", " for i in data:\n", " if tokenizer.index_word[i]=='<eos>':\n", " break\n", "\n", " a=a+' '+tokenizer.index_word[i]\n", " return a\n", "index=random.sample(range(0,input_tensor_val.shape[0]),1000)\n", "\n", "def bleu_score(input_val,target_val):\n", " score=0\n", " for i in index: \n", " inn=input_val[i]\n", " out=target_val[i]\n", " in_sen=tok2word1(inn,inp_lang_tokenizer)\n", " out_sen=tok2word2(out,targ_lang_tokenizer)\n", " ref=[out_sen.split(),]\n", " translation,_,_ = evaluate(in_sen)\n", " trans=translation.split()[:-1]\n", " res=bleu.sentence_bleu(ref, trans,)\n", " score=score+res\n", " score=score/1000\n", " print('avg. bleu score : ',score)" ] }, { "cell_type": "markdown", "metadata": { "id": "4HN9md7ne_5U" }, "source": [ "# DOT" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "id": "HDcvMcy3fCX6" }, "outputs": [], "source": [ "vocab_inp_size = len(inp_lang_tokenizer.word_index)+1\n", "vocab_tar_size = len(targ_lang_tokenizer.word_index)+1\n", "embedding_size=378\n", "lstm_units=470\n", "input_length=input_tensor.shape[1]\n", "output_length=decoder_input_target_tensor_train.shape[1]\n", "batch_size=48\n", "\n", "encoder=Encoder(vocab_inp_size,embedding_size,lstm_units,input_length)\n", "initial_state=encoder.initialize_states(batch_size)\n", "\n", "scoring_fun='dot'" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "id": "pKCJCcP_fCOh" }, "outputs": [], "source": [ "model = encoder_decoder(vocab_inp_size,embedding_size,lstm_units,input_length,batch_size,vocab_tar_size,output_length,scoring_fun)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "fdHdaNnmfCGB" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam()\n", "\n", "model.compile(optimizer=optimizer,loss=loss_function)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "s5OKFb38afVy" }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3ExO84DMfQUM", "outputId": "3be77b89-ed3b-44de-80ac-0995c721b563" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/6\n", "5699/5699 [==============================] - 1624s 283ms/step - loss: 1.6256 - val_loss: 0.5345\n", "Epoch 2/6\n", "5699/5699 [==============================] - 1617s 284ms/step - loss: 0.4072 - val_loss: 0.2864\n", "Epoch 3/6\n", "5699/5699 [==============================] - 1639s 288ms/step - loss: 0.1912 - val_loss: 0.2244\n", "Epoch 4/6\n", "5699/5699 [==============================] - 1613s 283ms/step - loss: 0.1168 - val_loss: 0.1998\n", "Epoch 5/6\n", "5699/5699 [==============================] - 1612s 283ms/step - loss: 0.0825 - val_loss: 0.1916\n", "Epoch 6/6\n", "5699/5699 [==============================] - 1607s 282ms/step - loss: 0.0638 - val_loss: 0.1892\n" ] }, { "data": { "text/plain": [ "<tensorflow.python.keras.callbacks.History at 0x7f9091b33940>" ] }, "execution_count": 46, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "model.fit([input_tensor_train,decoder_input_target_tensor_train], decoder_output_target_tensor_train,epochs=6,batch_size=48,validation_data=([input_tensor_val,decoder_input_target_tensor_val], decoder_output_target_tensor_val))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "gT5xgv-FYfph" }, "outputs": [], "source": [ "model.save_weights(\"/content/drive/My Drive/model_dot_1/dot_pos2.hdf5\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ypBvNNd6oOrj" }, "source": [ "##### Model load" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "lxmUqrIwTRA0" }, "outputs": [], "source": [ "model.load_weights(\"/content/drive/My Drive/model_dot_1/dot_pos2.hdf5\")" ] }, { "cell_type": "markdown", "metadata": { "id": "7tAr-zipmhbw" }, "source": [ "##### TRANSLATION" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 684 }, "id": "DliiOdsGfQFB", "outputId": "fc63aff0-ddd7-4099-a757-48df34ec05a6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> amo la mela <eos>\n", "Predicted translation: i love the apple . <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : i love apple\n" ] } ], "source": [ "translate(u'amo la mela ')\n", "print('Actual eng sentence : i love apple')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 711 }, "id": "qLvQa3gyfP-i", "outputId": "94515393-43bb-4ab8-dc67-6f7d5db5f4ce" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> non posso rispondere alla tua domanda <eos>\n", "Predicted translation: i can not answer your question . <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : i can not answer your question \n" ] } ], "source": [ "translate('non posso rispondere alla tua domanda ') \n", "print('Actual eng sentence : i can not answer your question ')" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 715 }, "id": "26AHA59EfP3i", "outputId": "d630655d-c2d4-4114-f6f4-bc1a27faec61" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> tom non sembrava essere molto interessato alla scuola <eos>\n", "Predicted translation: tom did not seem to be very interested in school ? <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : tom did not seem to be very interested in school \n" ] } ], "source": [ "translate(u'tom non sembrava essere molto interessato alla scuola ')\n", "print('Actual eng sentence : tom did not seem to be very interested in school ')" ] }, { "cell_type": "markdown", "metadata": { "id": "oJeaCCLenpWL" }, "source": [ "###### SCORE" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1kTq59ujnod5", "outputId": "a565cf13-fff5-41ce-8614-a92ab9f984a3" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 2-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n", "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 4-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n", "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 3-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "avg. bleu score : 0.8502854061401058\n" ] } ], "source": [ "bleu_score(input_tensor_val,decoder_output_target_tensor_val)" ] }, { "cell_type": "markdown", "metadata": { "id": "Fl1peAwdh9uX" }, "source": [ "# CONCAT" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "A8FLRcGJPSV7" }, "outputs": [], "source": [ "vocab_inp_size = len(inp_lang_tokenizer.word_index)+1\n", "vocab_tar_size = len(targ_lang_tokenizer.word_index)+1\n", "embedding_size=378\n", "lstm_units=470\n", "input_length=input_tensor.shape[1]\n", "output_length=decoder_input_target_tensor_train.shape[1]\n", "batch_size=48\n", "\n", "steps_per_epoch = ((len(input_tensor_train)+1)//batch_size)+1\n", "scoring_fun='concat'\n", "\n", "encoder=Encoder(vocab_inp_size,embedding_size,lstm_units,input_length)\n", "initial_state=encoder.initialize_states(batch_size)\n", "\n", "model = encoder_decoder(vocab_inp_size,embedding_size,lstm_units,input_length,batch_size,vocab_tar_size,output_length,scoring_fun)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "_UNiZux2Oxmx" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam()\n", "model.compile(optimizer=optimizer,loss=loss_function)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "9MWdH1kGE8zk" }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "sYsbzMTfpphn" }, "outputs": [], "source": [ "model.fit([input_tensor_train,decoder_input_target_tensor_train], decoder_output_target_tensor_train,epochs=6,batch_size=48,validation_data=([input_tensor_val,decoder_input_target_tensor_val], decoder_output_target_tensor_val))" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "id": "5-stihW-py06" }, "outputs": [], "source": [ "model.load_weights('/content/drive/My Drive/model_concat_1/con_pos2.hdf5')" ] }, { "cell_type": "markdown", "metadata": { "id": "lkrQp-t-oWnx" }, "source": [ "##### Model Load" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wMfES1dmKyiT" }, "outputs": [], "source": [ "model.load_weights(\"/content/drive/My Drive/model_concat_1/con_pos2.hdf5\")" ] }, { "cell_type": "markdown", "metadata": { "id": "eN48oyP4pBf9" }, "source": [ "##### TRANSLATION" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 684 }, "id": "P1udJQKv8SjW", "outputId": "9f5d826c-0faf-4fec-d482-d555cadf4d9d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> amo la mela . <eos>\n", "Predicted translation: i love apple . <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : i love apple\n" ] } ], "source": [ "translate(u'amo la mela .')\n", "print('Actual eng sentence : i love apple')" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 711 }, "id": "2OGwntnoNMAF", "outputId": "71b96121-02fa-4ae8-a4d5-7c8c94eeae39" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> non posso rispondere alla tua domanda . <eos>\n", "Predicted translation: i can not answer your question . <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : i can not answer your question \n" ] } ], "source": [ "translate('non posso rispondere alla tua domanda .') \n", "print('Actual eng sentence : i can not answer your question ')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 715 }, "id": "scT5gX638SEQ", "outputId": "2527da42-0757-4ad1-f1b2-e95e192212a8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input: <sos> tom non sembrava essere molto interessato alla scuola . <eos>\n", "Predicted translation: tom did not seem to be very interested in school . <eos> \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Actual eng sentence : tom did not seem to be very interested in school .\n" ] } ], "source": [ "translate(u'tom non sembrava essere molto interessato alla scuola .')\n", "print('Actual eng sentence : tom did not seem to be very interested in school .')" ] }, { "cell_type": "markdown", "metadata": { "id": "pzkoQfT2i40w" }, "source": [ "###### SCORE" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lEvd7xEi8R0i", "outputId": "c4a96712-5819-4173-e4df-0636f53d5788" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 4-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n", "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 3-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n", "/usr/local/lib/python3.6/dist-packages/nltk/translate/bleu_score.py:490: UserWarning: \n", "Corpus/Sentence contains 0 counts of 2-gram overlaps.\n", "BLEU scores might be undesirable; use SmoothingFunction().\n", " warnings.warn(_msg)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "avg. bleu score : 0.8492603959084802\n" ] } ], "source": [ "bleu_score(input_tensor_val,decoder_output_target_tensor_val)" ] }, { "cell_type": "markdown", "metadata": { "id": "X1RaW5C4wB-J" }, "source": [ "# OBSERVATION\n", "1. Best bleu score which i am getting is 0.85 from model dot.\n", "2. By both the score we are getting same answer only.\n", "3. Now if i have to select best model i will go to concat model because it's score is almost same but it's word dependence is better than dot.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1gNQJk9R8RuV" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "Copy of seq_2_copy_seq.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }