{ "cells": [ { "cell_type": "markdown", "id": "87514413", "metadata": {}, "source": [ "# Code Java to C# using T5\n", "\n", "This tutorial contains complete code to fine-tune T5 to perform Seq2Seq on CodexGLUE Code to Code dataset. \n", "In addition to training a model, you will learn how to preprocess text into an appropriate format.\n", "\n", "In this notebook, you will:\n", "\n", "- Load the CodexGLUE code to code dataset from HuggingFace\n", "- Load T5 Model using tf-transformers\n", "- Build train and validation dataset (on the fly) feature preparation using\n", "tokenizer from tf-transformers.\n", "- Train your own model, fine-tuning T5 as part of that\n", "- Evaluate BLEU on the generated text \n", "- Save your model and use it to convert Java to C# sentences\n", "- Use the end-to-end (preprocessing + inference) in production setup\n", "\n", "If you're new to working with the CodexGLUE dataset, please see [CodexGLUE](https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans) for more details." ] }, { "cell_type": "code", "execution_count": null, "id": "9a63478b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2a0566ac", "metadata": {}, "outputs": [], "source": [ "!pip install tf-transformers\n", "\n", "!pip install sentencepiece\n", "\n", "!pip install tensorflow-text\n", "\n", "!pip install transformers\n", "\n", "!pip install wandb\n", "\n", "!pip install datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "b2df24a1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 52, "id": "104f0611", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tensorflow version 2.7.0\n", "Tensorflow text version 2.7.0\n", "Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]\n" ] } ], "source": [ "import os\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Supper TF warnings\n", "\n", "import tensorflow as tf\n", "import tensorflow_text as tf_text\n", "import datasets\n", "import tqdm\n", "import wandb\n", "\n", "print(\"Tensorflow version\", tf.__version__)\n", "print(\"Tensorflow text version\", tf_text.__version__)\n", "print(\"Devices\", tf.config.list_physical_devices())\n", "\n", "from tf_transformers.models import T5Model, T5TokenizerTFText\n", "from tf_transformers.core import Trainer\n", "from tf_transformers.optimization import create_optimizer\n", "from tf_transformers.losses import cross_entropy_loss_label_smoothing\n", "from tf_transformers.text import TextDecoder" ] }, { "cell_type": "code", "execution_count": null, "id": "e480767d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e1467206", "metadata": {}, "source": [ "### Load Model, Optimizer , Trainer\n", "\n", "Our Trainer expects ```model```, ```optimizer``` and ```loss``` to be a function." ] }, { "cell_type": "code", "execution_count": 46, "id": "aa7b868c", "metadata": {}, "outputs": [], "source": [ "# Load Model\n", "def get_model(model_name, is_training, use_dropout):\n", " \"\"\"Get Model\"\"\"\n", "\n", " def model_fn():\n", " model = T5Model.from_pretrained(model_name)\n", " return model\n", " return model_fn\n", "\n", "# Load Optimizer\n", "def get_optimizer(learning_rate, examples, batch_size, epochs, use_constant_lr=False):\n", " \"\"\"Get optimizer\"\"\"\n", " steps_per_epoch = int(examples / batch_size)\n", " num_train_steps = steps_per_epoch * epochs\n", " warmup_steps = int(0.1 * num_train_steps)\n", "\n", " def optimizer_fn():\n", " optimizer, learning_rate_fn = create_optimizer(learning_rate, num_train_steps, warmup_steps, use_constant_lr=use_constant_lr)\n", " return optimizer\n", "\n", " return optimizer_fn\n", "\n", "# Load trainer\n", "def get_trainer(distribution_strategy, num_gpus=0, tpu_address=None):\n", " \"\"\"Get Trainer\"\"\"\n", " trainer = Trainer(distribution_strategy, num_gpus=num_gpus, tpu_address=tpu_address)\n", " return trainer\n", "\n", "# Load loss\n", "def loss_fn(y_true_dict, y_pred_dict, smoothing=0.1):\n", " \n", " loss = cross_entropy_loss_label_smoothing(labels=y_true_dict['labels'], \n", " logits=y_pred_dict['token_logits'],\n", " smoothing=smoothing,\n", " label_weights=y_true_dict['labels_mask'])\n", " return {'loss': loss}" ] }, { "cell_type": "code", "execution_count": null, "id": "f5f241ed", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "86e07638", "metadata": {}, "source": [ "### Prepare Data for Training\n", "\n", "We will make use of ```Tensorflow Text``` based tokenizer to do ```on-the-fly``` preprocessing, without having any\n", "overhead of pre prepapre the data in the form of ```pickle```, ```numpy``` or ```tfrecords```." ] }, { "cell_type": "code", "execution_count": 36, "id": "cf39021c", "metadata": {}, "outputs": [], "source": [ "# Load dataset\n", "def load_dataset(dataset, tokenizer_layer, max_seq_len, batch_size, drop_remainder):\n", " \"\"\"\n", " Args:\n", " dataset; HuggingFace dataset\n", " tokenizer_layer: tf-transformers tokenizer\n", " max_seq_len: int (maximum sequence length of text)\n", " batch_size: int (batch_size)\n", " drop_remainder: bool (to drop remaining batch_size, when its uneven)\n", " \"\"\"\n", " def parse(item):\n", " # Encoder inputs\n", " encoder_input_ids = tokenizer_layer({'text': [item['java']]})\n", " encoder_input_ids = encoder_input_ids.merge_dims(-2, 1)\n", " encoder_input_ids = encoder_input_ids[:max_seq_len-1]\n", " encoder_input_ids = tf.concat([encoder_input_ids, [tokenizer_layer.eos_token_id]], axis=0)\n", "\n", " # Decoder inputs\n", " decoder_input_ids = tokenizer_layer({'text': [item['cs']]})\n", " decoder_input_ids = decoder_input_ids.merge_dims(-2, 1)\n", " decoder_input_ids = decoder_input_ids[:max_seq_len-2]\n", " decoder_input_ids = tf.concat([[tokenizer_layer.pad_token_id] , decoder_input_ids, [tokenizer_layer.eos_token_id]], axis=0)\n", "\n", "\n", " encoder_input_mask = tf.ones_like(encoder_input_ids)\n", " labels = decoder_input_ids[1:]\n", " labels_mask = tf.ones_like(labels)\n", " decoder_input_ids = decoder_input_ids[:-1]\n", "\n", " result = {}\n", " result['encoder_input_ids'] = encoder_input_ids\n", " result['encoder_input_mask'] = encoder_input_mask\n", " result['decoder_input_ids'] = decoder_input_ids\n", "\n", " labels_dict = {}\n", " labels_dict['labels'] = labels\n", " labels_dict['labels_mask'] = labels_mask\n", " return result, labels_dict\n", "\n", " tfds_dict = dataset.to_dict()\n", " tfdataset = tf.data.Dataset.from_tensor_slices(tfds_dict).shuffle(100)\n", "\n", " tfdataset = tfdataset.map(parse, num_parallel_calls =tf.data.AUTOTUNE)\n", " tfdataset = tfdataset.padded_batch(batch_size, drop_remainder=drop_remainder)\n", "\n", " # Shard\n", " options = tf.data.Options()\n", " options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.AUTO\n", " tfdataset = tfdataset.with_options(options)\n", " \n", " return tfdataset" ] }, { "cell_type": "code", "execution_count": null, "id": "9c3d44f9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "57e74b03", "metadata": {}, "source": [ "### Prepare Dataset\n", "\n", "1. Set necessay hyperparameters.\n", "2. Prepare ```train dataset```, ```validation dataset```.\n", "3. Load ```model```, ```optimizer```, ```loss``` and ```trainer```." ] }, { "cell_type": "code", "execution_count": null, "id": "d8ad7315", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 47, "id": "b764c89a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:datasets.builder:Reusing dataset code_x_glue_cc_code_to_code_trans (/home/jovyan/.cache/huggingface/datasets/code_x_glue_cc_code_to_code_trans/default/0.0.0/86dd57d2b1e88c6e589646133b76f2fef9d56c82e933d7f276e8a5b60ab18c34)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bab78b75b8534d8f8e5ab85ecf18f10f", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00 float32\n", "INFO:absl:Strategy: ---> \n", "INFO:absl:Num GPU Devices: ---> 1\n", "INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__t5-small.main.699b12fe9601feda4892ca82c07e800f3c1da440/ckpt-1\n", "INFO:absl:Successful ✅: Loaded model from tftransformers/t5-small\n", "INFO:absl:Using Constant learning rate\n", "INFO:absl:Using Adamw optimizer\n", "INFO:absl:No ❌❌ checkpoint found in MODELS/t5_code_to_code\n", "Train: Epoch 1/11 --- Step 300/321 --- total examples 6400 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:17<00:00, 45.81s/batch , _runtime=365, _timestamp=1.65e+9, learning_rate=0.0001, loss=2.17]\n", "INFO:absl:Model saved at epoch 1 at MODELS/t5_code_to_code/ckpt-1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 2/11 --- Step 300/321 --- total examples 16000 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.96s/batch , _runtime=490, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.926]\n", "INFO:absl:Model saved at epoch 2 at MODELS/t5_code_to_code/ckpt-2\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 3/11 --- Step 300/321 --- total examples 25600 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.18s/batch , _runtime=616, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.691]\n", "INFO:absl:Model saved at epoch 3 at MODELS/t5_code_to_code/ckpt-3\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 4/11 --- Step 300/321 --- total examples 35200 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.97s/batch , _runtime=742, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.575]\n", "INFO:absl:Model saved at epoch 4 at MODELS/t5_code_to_code/ckpt-4\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 5/11 --- Step 300/321 --- total examples 44800 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.68s/batch , _runtime=866, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.51] \n", "INFO:absl:Model saved at epoch 5 at MODELS/t5_code_to_code/ckpt-5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 6/11 --- Step 300/321 --- total examples 54400 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.23s/batch , _runtime=992, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.462]\n", "INFO:absl:Model saved at epoch 6 at MODELS/t5_code_to_code/ckpt-6\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 7/11 --- Step 300/321 --- total examples 64000 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.30s/batch , _runtime=1118, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.418]\n", "INFO:absl:Model saved at epoch 7 at MODELS/t5_code_to_code/ckpt-7\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 8/11 --- Step 300/321 --- total examples 73600 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.03s/batch , _runtime=1243, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.391]\n", "INFO:absl:Model saved at epoch 8 at MODELS/t5_code_to_code/ckpt-8\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 9/11 --- Step 300/321 --- total examples 83200 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:02<00:00, 40.91s/batch , _runtime=1368, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.368]\n", "INFO:absl:Model saved at epoch 9 at MODELS/t5_code_to_code/ckpt-9\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Train: Epoch 10/11 --- Step 300/321 --- total examples 92800 , trainable variables 132: 100%|\u001b[32m██████████\u001b[0m| 3/3 [02:03<00:00, 41.30s/batch , _runtime=1495, _timestamp=1.65e+9, learning_rate=0.0001, loss=0.348]\n", "INFO:absl:Model saved at epoch 10 at MODELS/t5_code_to_code/ckpt-10\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "history = trainer.run(\n", " model_fn=model_fn,\n", " optimizer_fn=optimizer_fn,\n", " train_dataset=train_dataset,\n", " train_loss_fn=loss_fn,\n", " epochs=epochs,\n", " steps_per_epoch=steps_per_epoch,\n", " model_checkpoint_dir=model_checkpoint_dir,\n", " batch_size=batch_size,\n", " wandb=wandb\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "dee1f7d1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4e960eb9", "metadata": {}, "source": [ "### Load and Serialize Model for Text Generation\n", "* 1. Load T5Model with ```use_auto_regressive=True```" ] }, { "cell_type": "code", "execution_count": null, "id": "e15f5916", "metadata": {}, "outputs": [], "source": [ "# Load T5 for Auto Regressive\n", "model = T5Model.from_pretrained(model_name, use_auto_regressive=True)\n", "# Load from checkpoint dir\n", "model.load_checkpoint(model_checkpoint_dir)\n", "# Save and serialize\n", "model.save_transformers_serialized('{}/saved_model'.format(model_checkpoint_dir), overwrite=True)\n", "# Load model\n", "loaded = tf.saved_model.load('{}/saved_model'.format(model_checkpoint_dir))" ] }, { "cell_type": "markdown", "id": "1fc2a12a", "metadata": {}, "source": [ "### Evaluate on Test ( BLEU ) score" ] }, { "cell_type": "code", "execution_count": 77, "id": "42bc4633", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 32/32 [02:19<00:00, 4.36s/it]\n" ] } ], "source": [ "# Load decoder\n", "decoder = TextDecoder(model=loaded)\n", "\n", "# greedy decoding\n", "predicted_text = []\n", "original_text = []\n", "for (batch_inputs, batch_labels) in tqdm.tqdm(validation_dataset):\n", " \n", " decoder_input_ids = batch_inputs['decoder_input_ids']\n", " \n", " # While decoding we do not need this, decoder_start_token_id will be automatically taken from saved model\n", " del batch_inputs['decoder_input_ids']\n", " \n", " predictions = decoder.decode(batch_inputs, \n", " mode='greedy', \n", " max_iterations=max_seq_len, \n", " eos_id=tokenizer_layer.eos_token_id)\n", " # Decode predictions\n", " predicted_text_batch = tokenizer_layer._tokenizer.detokenize(tf.cast(tf.squeeze(predictions['predicted_ids'], axis=1), tf.int32))\n", " predicted_text_batch = [text.numpy().decode() for text in predicted_text_batch]\n", " predicted_text.extend(predicted_text_batch)\n", " \n", " # Decoder original text\n", " original_text_batch = tokenizer_layer._tokenizer.detokenize(decoder_input_ids)\n", " original_text_batch = [text.numpy().decode() for text in original_text_batch]\n", " original_text.extend(original_text_batch)\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "27a328e9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 82, "id": "d692e726", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BLEU: 40.06566362514778\n" ] } ], "source": [ "import sacrebleu\n", "from sacremoses import MosesDetokenizer\n", "# Calculate and print the BLEU score\n", "bleu = sacrebleu.corpus_bleu(predicted_text, [original_text])\n", "print(\"BLEU: {}\".format(bleu.score))" ] }, { "cell_type": "code", "execution_count": null, "id": "aab43927", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "formats": "ipynb,md:myst" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }