{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "87514413",
   "metadata": {},
   "source": [
    "# Code Java to C# using T5\n",
    "\n",
    "This tutorial contains complete code to fine-tune T5 to perform Seq2Seq on CodexGLUE Code to Code dataset. \n",
    "In addition to training a model, you will learn how to preprocess text into an appropriate format.\n",
    "\n",
    "In this notebook, you will:\n",
    "\n",
    "- Load the CodexGLUE code to code dataset from HuggingFace\n",
    "- Load T5 Model using tf-transformers\n",
    "- Build train and validation dataset (on the fly) feature preparation using\n",
    "tokenizer from tf-transformers.\n",
    "- Train your own model, fine-tuning T5 as part of that\n",
    "- Evaluate BLEU on the generated text \n",
    "- Save your model and use it to convert Java to C# sentences\n",
    "- Use the end-to-end (preprocessing + inference) in production setup\n",
    "\n",
    "If you're new to working with the CodexGLUE dataset, please see [CodexGLUE](https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a63478b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a0566ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tf-transformers\n",
    "\n",
    "!pip install sentencepiece\n",
    "\n",
    "!pip install tensorflow-text\n",
    "\n",
    "!pip install transformers\n",
    "\n",
    "!pip install wandb\n",
    "\n",
    "!pip install datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2df24a1",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "104f0611",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensorflow version 2.7.0\n",
      "Tensorflow text version 2.7.0\n",
      "Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Supper TF warnings\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow_text as tf_text\n",
    "import datasets\n",
    "import tqdm\n",
    "import wandb\n",
    "\n",
    "print(\"Tensorflow version\", tf.__version__)\n",
    "print(\"Tensorflow text version\", tf_text.__version__)\n",
    "print(\"Devices\", tf.config.list_physical_devices())\n",
    "\n",
    "from tf_transformers.models import T5Model, T5TokenizerTFText\n",
    "from tf_transformers.core import Trainer\n",
    "from tf_transformers.optimization import create_optimizer\n",
    "from tf_transformers.losses import cross_entropy_loss_label_smoothing\n",
    "from tf_transformers.text import TextDecoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e480767d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e1467206",
   "metadata": {},
   "source": [
    "### Load Model, Optimizer , Trainer\n",
    "\n",
    "Our Trainer expects ```model```, ```optimizer``` and ```loss``` to be a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "aa7b868c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load Model\n",
    "def get_model(model_name, is_training, use_dropout):\n",
    "  \"\"\"Get Model\"\"\"\n",
    "\n",
    "  def model_fn():\n",
    "    model = T5Model.from_pretrained(model_name)\n",
    "    return model\n",
    "  return model_fn\n",
    "\n",
    "# Load Optimizer\n",
    "def get_optimizer(learning_rate, examples, batch_size, epochs, use_constant_lr=False):\n",
    "    \"\"\"Get optimizer\"\"\"\n",
    "    steps_per_epoch = int(examples / batch_size)\n",
    "    num_train_steps = steps_per_epoch * epochs\n",
    "    warmup_steps = int(0.1 * num_train_steps)\n",
    "\n",
    "    def optimizer_fn():\n",
    "        optimizer, learning_rate_fn = create_optimizer(learning_rate, num_train_steps, warmup_steps, use_constant_lr=use_constant_lr)\n",
    "        return optimizer\n",
    "\n",
    "    return optimizer_fn\n",
    "\n",
    "# Load trainer\n",
    "def get_trainer(distribution_strategy, num_gpus=0, tpu_address=None):\n",
    "    \"\"\"Get Trainer\"\"\"\n",
    "    trainer = Trainer(distribution_strategy, num_gpus=num_gpus, tpu_address=tpu_address)\n",
    "    return trainer\n",
    "\n",
    "# Load loss\n",
    "def loss_fn(y_true_dict, y_pred_dict, smoothing=0.1):\n",
    "    \n",
    "    loss = cross_entropy_loss_label_smoothing(labels=y_true_dict['labels'], \n",
    "                                   logits=y_pred_dict['token_logits'],\n",
    "                                   smoothing=smoothing,\n",
    "                                      label_weights=y_true_dict['labels_mask'])\n",
    "    return {'loss': loss}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5f241ed",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "86e07638",
   "metadata": {},
   "source": [
    "### Prepare Data for Training\n",
    "\n",
    "We will make use of ```Tensorflow Text``` based tokenizer to do ```on-the-fly``` preprocessing, without having any\n",
    "overhead of pre prepapre the data in the form of ```pickle```, ```numpy``` or ```tfrecords```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "cf39021c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dataset\n",
    "def load_dataset(dataset, tokenizer_layer, max_seq_len, batch_size, drop_remainder):\n",
    "    \"\"\"\n",
    "    Args:\n",
    "      dataset; HuggingFace dataset\n",
    "      tokenizer_layer: tf-transformers tokenizer\n",
    "      max_seq_len: int (maximum sequence length of text)\n",
    "      batch_size: int (batch_size)\n",
    "      drop_remainder: bool (to drop remaining batch_size, when its uneven)\n",
    "    \"\"\"\n",
    "    def parse(item):\n",
    "        # Encoder inputs\n",
    "        encoder_input_ids = tokenizer_layer({'text': [item['java']]})\n",
    "        encoder_input_ids = encoder_input_ids.merge_dims(-2, 1)\n",
    "        encoder_input_ids = encoder_input_ids[:max_seq_len-1]\n",
    "        encoder_input_ids = tf.concat([encoder_input_ids, [tokenizer_layer.eos_token_id]], axis=0)\n",
    "\n",
    "        # Decoder inputs\n",
    "        decoder_input_ids = tokenizer_layer({'text': [item['cs']]})\n",
    "        decoder_input_ids = decoder_input_ids.merge_dims(-2, 1)\n",
    "        decoder_input_ids = decoder_input_ids[:max_seq_len-2]\n",
    "        decoder_input_ids = tf.concat([[tokenizer_layer.pad_token_id] , decoder_input_ids, [tokenizer_layer.eos_token_id]], axis=0)\n",
    "\n",
    "\n",
    "        encoder_input_mask = tf.ones_like(encoder_input_ids)\n",
    "        labels = decoder_input_ids[1:]\n",
    "        labels_mask = tf.ones_like(labels)\n",
    "        decoder_input_ids = decoder_input_ids[:-1]\n",
    "\n",
    "        result = {}\n",
    "        result['encoder_input_ids'] = encoder_input_ids\n",
    "        result['encoder_input_mask'] = encoder_input_mask\n",
    "        result['decoder_input_ids'] = decoder_input_ids\n",
    "\n",
    "        labels_dict = {}\n",
    "        labels_dict['labels'] = labels\n",
    "        labels_dict['labels_mask'] = labels_mask\n",
    "        return result, labels_dict\n",
    "\n",
    "    tfds_dict = dataset.to_dict()\n",
    "    tfdataset = tf.data.Dataset.from_tensor_slices(tfds_dict).shuffle(100)\n",
    "\n",
    "    tfdataset = tfdataset.map(parse, num_parallel_calls =tf.data.AUTOTUNE)\n",
    "    tfdataset = tfdataset.padded_batch(batch_size, drop_remainder=drop_remainder)\n",
    "\n",
    "    # Shard\n",
    "    options = tf.data.Options()\n",
    "    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.AUTO\n",
    "    tfdataset = tfdataset.with_options(options)\n",
    "    \n",
    "    return tfdataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c3d44f9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "57e74b03",
   "metadata": {},
   "source": [
    "### Prepare Dataset\n",
    "\n",
    "1. Set necessay hyperparameters.\n",
    "2. Prepare ```train dataset```, ```validation dataset```.\n",
    "3. Load ```model```, ```optimizer```, ```loss``` and ```trainer```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8ad7315",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "b764c89a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:datasets.builder:Reusing dataset code_x_glue_cc_code_to_code_trans (/home/jovyan/.cache/huggingface/datasets/code_x_glue_cc_code_to_code_trans/default/0.0.0/86dd57d2b1e88c6e589646133b76f2fef9d56c82e933d7f276e8a5b60ab18c34)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bab78b75b8534d8f8e5ab85ecf18f10f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:absl:Loading t5-small tokenizer to /tmp/tftransformers_tokenizer_cache/t5-small/spiece.model\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)\n"
     ]
    }
   ],
   "source": [
    "# Data configs\n",
    "dataset_name = 'code_x_glue_cc_code_to_code_trans'\n",
    "model_name  = 't5-small'\n",
    "max_seq_len = 256\n",
    "batch_size  = 32\n",
    "\n",
    "# Model configs\n",
    "learning_rate = 1e-4\n",
    "epochs = 10\n",
    "model_checkpoint_dir = 'MODELS/t5_code_to_code'\n",
    "\n",
    "# Load HF dataset\n",
    "dataset = datasets.load_dataset(dataset_name)\n",
    "# Load tokenizer from tf-transformers\n",
    "tokenizer_layer = T5TokenizerTFText.from_pretrained(model_name)\n",
    "# Train Dataset\n",
    "train_dataset = load_dataset(dataset['train'], tokenizer_layer, max_seq_len, batch_size, drop_remainder=True)\n",
    "# Validation Dataset\n",
    "validation_dataset = load_dataset(dataset['test'], tokenizer_layer, max_seq_len, batch_size, drop_remainder=False)\n",
    "\n",
    "# Total train examples\n",
    "total_train_examples = dataset['train'].num_rows\n",
    "steps_per_epoch = total_train_examples // batch_size\n",
    "\n",
    "# model\n",
    "model_fn =  get_model(model_name, is_training=True, use_dropout=True)\n",
    "# optimizer\n",
    "optimizer_fn = get_optimizer(learning_rate, total_train_examples, batch_size, epochs, use_constant_lr=True)\n",
    "# trainer\n",
    "trainer = get_trainer(distribution_strategy='mirrored', num_gpus=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecd8214c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "89646e44",
   "metadata": {},
   "source": [
    "### Wandb Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0abad147",
   "metadata": {},
   "outputs": [],
   "source": [
    "project = \"TUTORIALS\"\n",
    "display_name = \"t5_code_to_code\"\n",
    "wandb.init(project=project, name=display_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1faa8deb",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5ad04247",
   "metadata": {},
   "source": [
    "### Train :-)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "bb78ce89",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:absl:Make sure `steps_per_epoch` should be less than or equal to number of batches in dataset.\n",
      "INFO:absl:Policy: ----> float32\n",
      "INFO:absl:Strategy: ---> <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f9d941a1810>\n",
      "INFO:absl:Num GPU Devices: ---> 1\n",
      "INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__t5-small.main.699b12fe9601feda4892ca82c07e800f3c1da440/ckpt-1\n",
      "INFO:absl:Successful ✅: Loaded model from tftransformers/t5-small\n",
      "INFO:absl:Using Constant learning rate\n",
      "INFO:absl:Using Adamw optimizer\n",
      "INFO:absl:No ❌❌ checkpoint found in MODELS/t5_code_to_code\n",
      "Train: Epoch 1/11 --- Step 300/321 --- total examples 6400 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:17<00:00, 45.81s/batch , _runtime=365, _timestamp=1.65e+9, learning_rate=0.0001, loss=2.17]\n",
      "INFO:absl:Model saved at epoch 1 at MODELS/t5_code_to_code/ckpt-1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 2/11 --- Step 300/321 --- total examples 16000 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.96s/batch , _runtime=490, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.926]\n",
      "INFO:absl:Model saved at epoch 2 at MODELS/t5_code_to_code/ckpt-2\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 3/11 --- Step 300/321 --- total examples 25600 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.18s/batch , _runtime=616, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.691]\n",
      "INFO:absl:Model saved at epoch 3 at MODELS/t5_code_to_code/ckpt-3\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 4/11 --- Step 300/321 --- total examples 35200 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.97s/batch , _runtime=742, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.575]\n",
      "INFO:absl:Model saved at epoch 4 at MODELS/t5_code_to_code/ckpt-4\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 5/11 --- Step 300/321 --- total examples 44800 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.68s/batch , _runtime=866, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.51] \n",
      "INFO:absl:Model saved at epoch 5 at MODELS/t5_code_to_code/ckpt-5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 6/11 --- Step 300/321 --- total examples 54400 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.23s/batch , _runtime=992, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.462]\n",
      "INFO:absl:Model saved at epoch 6 at MODELS/t5_code_to_code/ckpt-6\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 7/11 --- Step 300/321 --- total examples 64000 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.30s/batch , _runtime=1118, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.418]\n",
      "INFO:absl:Model saved at epoch 7 at MODELS/t5_code_to_code/ckpt-7\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 8/11 --- Step 300/321 --- total examples 73600 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.03s/batch , _runtime=1243, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.391]\n",
      "INFO:absl:Model saved at epoch 8 at MODELS/t5_code_to_code/ckpt-8\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 9/11 --- Step 300/321 --- total examples 83200 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.91s/batch , _runtime=1368, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.368]\n",
      "INFO:absl:Model saved at epoch 9 at MODELS/t5_code_to_code/ckpt-9\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Train: Epoch 10/11 --- Step 300/321 --- total examples 92800 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.30s/batch , _runtime=1495, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.348]\n",
      "INFO:absl:Model saved at epoch 10 at MODELS/t5_code_to_code/ckpt-10\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "history = trainer.run(\n",
    "    model_fn=model_fn,\n",
    "    optimizer_fn=optimizer_fn,\n",
    "    train_dataset=train_dataset,\n",
    "    train_loss_fn=loss_fn,\n",
    "    epochs=epochs,\n",
    "    steps_per_epoch=steps_per_epoch,\n",
    "    model_checkpoint_dir=model_checkpoint_dir,\n",
    "    batch_size=batch_size,\n",
    "    wandb=wandb\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dee1f7d1",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "4e960eb9",
   "metadata": {},
   "source": [
    "### Load and Serialize Model for Text Generation\n",
    "* 1. Load T5Model with ```use_auto_regressive=True```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e15f5916",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load T5 for Auto Regressive\n",
    "model = T5Model.from_pretrained(model_name, use_auto_regressive=True)\n",
    "# Load from checkpoint dir\n",
    "model.load_checkpoint(model_checkpoint_dir)\n",
    "# Save and serialize\n",
    "model.save_transformers_serialized('{}/saved_model'.format(model_checkpoint_dir), overwrite=True)\n",
    "# Load model\n",
    "loaded = tf.saved_model.load('{}/saved_model'.format(model_checkpoint_dir))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fc2a12a",
   "metadata": {},
   "source": [
    "### Evaluate on Test ( BLEU ) score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "42bc4633",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 32/32 [02:19<00:00,  4.36s/it]\n"
     ]
    }
   ],
   "source": [
    "# Load decoder\n",
    "decoder = TextDecoder(model=loaded)\n",
    "\n",
    "# greedy decoding\n",
    "predicted_text = []\n",
    "original_text  = []\n",
    "for (batch_inputs, batch_labels) in tqdm.tqdm(validation_dataset):\n",
    "    \n",
    "    decoder_input_ids = batch_inputs['decoder_input_ids']\n",
    "    \n",
    "    # While decoding we do not need this, decoder_start_token_id will be automatically taken from saved model\n",
    "    del batch_inputs['decoder_input_ids']\n",
    "    \n",
    "    predictions = decoder.decode(batch_inputs, \n",
    "                             mode='greedy', \n",
    "                             max_iterations=max_seq_len, \n",
    "                             eos_id=tokenizer_layer.eos_token_id)\n",
    "    # Decode predictions\n",
    "    predicted_text_batch = tokenizer_layer._tokenizer.detokenize(tf.cast(tf.squeeze(predictions['predicted_ids'], axis=1), tf.int32))\n",
    "    predicted_text_batch = [text.numpy().decode() for text in predicted_text_batch]\n",
    "    predicted_text.extend(predicted_text_batch)\n",
    "    \n",
    "    # Decoder original text\n",
    "    original_text_batch = tokenizer_layer._tokenizer.detokenize(decoder_input_ids)\n",
    "    original_text_batch = [text.numpy().decode() for text in original_text_batch]\n",
    "    original_text.extend(original_text_batch)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27a328e9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "d692e726",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BLEU: 40.06566362514778\n"
     ]
    }
   ],
   "source": [
    "import sacrebleu\n",
    "from sacremoses import MosesDetokenizer\n",
    "# Calculate and print the BLEU score\n",
    "bleu = sacrebleu.corpus_bleu(predicted_text, [original_text])\n",
    "print(\"BLEU: {}\".format(bleu.score))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aab43927",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md:myst"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}