Create Sentence Embedding Roberta Model + Zeroshot from Scratch¶

This tutorial contains complete code to fine-tune Roberta to build meaningful sentence transformers using Quora Dataset from HuggingFace. In addition to training a model, you will learn how to preprocess text into an appropriate format.

In this notebook, you will:

Load the Quora dataset from HuggingFace
Load Roberta Model using tf-transformers
Build train and validation dataset feature preparation using tokenizer from transformers.
Build your own model by combining Roberta with a CustomWrapper
Train your own model, fine-tuning Roberta as part of that
Save your model and use it to extract sentence embeddings
Use the end-to-end (inference) in production setup

If you’re new to working with the Quora dataset, please see QUORA for more details.

!pip install tf-transformers

!pip install transformers

!pip install wandb

!pip install datasets

import tensorflow as tf
import random
import collections
import wandb
import tempfile
import tqdm
import json

import os
import numpy as np

print("Tensorflow version", tf.__version__)
print("Devices", tf.config.list_physical_devices())

from tf_transformers.models import RobertaModel, Classification_Model
from tf_transformers.core import Trainer
from tf_transformers.optimization import create_optimizer
from tf_transformers.data import TFWriter, TFReader
from tf_transformers.losses import cross_entropy_loss_for_classification

from datasets import load_dataset


from transformers import RobertaTokenizer

Tensorflow version 2.7.0
Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

# Load Dataset
model_name = 'roberta-base'
dataset = load_dataset("quora")
tokenizer = RobertaTokenizer.from_pretrained(model_name)

# Load validation dataset
sts_b = load_dataset("stsb_multi_mt", 'en')

# Define length for examples
max_sequence_length = 128
batch_size = 128

Using custom data configuration default
Reusing dataset quora (/home/jovyan/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)

Reusing dataset stsb_multi_mt (/home/jovyan/.cache/huggingface/datasets/stsb_multi_mt/en/1.0.0/a5d260e4b7aa82d1ab7379523a005a366d9b124c76a5a5cf0c4c5365458b0ba9)

Prepare Training TFRecords using Quora¶

1. Download Quora dataset.
1. We will take only those row where is_duplicate=True. The model will be trained using in-batch negative loss.
1. Example data looks like a pair of sentences sentence1 (left sentence): What is the best Android smartphone?, sentence2 (right sentence): What is the best Android smartphone ever?

def parse_train(dataset, tokenizer, max_passage_length, key):
    """Function to parse examples which are is_duplicate=1

    Args:
        dataset (:obj:`dataet`): HF dataset
        tokenizer (:obj:`tokenizer`): HF Tokenizer
        max_passage_length (:obj:`int`): Passage Length
        key (:obj:`str`): Key of dataset (`train`, `validation` etc)
    """    
    result = {}
    for f in dataset[key]:
       
        question_left , question_right = f['questions']['text']
        question_left_input_ids =  tokenizer(question_left, max_length=max_passage_length, truncation=True)['input_ids'] 
        question_right_input_ids  =  tokenizer(question_right, max_length=max_passage_length, truncation=True)['input_ids']
        
        result = {}
        result['input_ids_left'] = question_left_input_ids
        result['input_ids_right'] = question_right_input_ids
        
        yield result
        
# Write using TF Writer
schema = {
    "input_ids_left": ("var_len", "int"),
    "input_ids_right": ("var_len", "int")
    
}

tfrecord_train_dir = tempfile.mkdtemp()
tfrecord_filename = 'quora'

tfwriter = TFWriter(schema=schema, 
                    file_name=tfrecord_filename, 
                    model_dir=tfrecord_train_dir,
                    tag='train',
                    overwrite=True
                    )

# Train dataset
train_parser_fn = parse_train(dataset, tokenizer, max_sequence_length, key='train')
tfwriter.process(parse_fn=train_parser_fn)

INFO:absl:Total individual observations/examples written is 404290 in 276.39959359169006 seconds
INFO:absl:All writer objects closed

Prepare Validation TFRecords using STS-b¶

Download STS dataset.
We will use this dataset to measure sentence embeddings by measuring the correlation

def parse_dev(dataset, tokenizer, max_passage_length, key):
    """Function to parse examples which are is_duplicate=1

    Args:
        dataset (:obj:`dataet`): HF dataset
        tokenizer (:obj:`tokenizer`): HF Tokenizer
        max_passage_length (:obj:`int`): Passage Length
        key (:obj:`str`): Key of dataset (`train`, `validation` etc)
    """    
    result = {}
    max_score = 5.0
    min_score = 0.0
    for f in dataset[key]:
        
        question_left = f['sentence1']
        question_right = f['sentence2']
        question_left_input_ids =  tokenizer(question_left, max_length=max_passage_length, truncation=True)['input_ids'] 
        question_right_input_ids  =  tokenizer(question_right, max_length=max_passage_length, truncation=True)['input_ids']
        
        result = {}
        result['input_ids_left'] = question_left_input_ids
        result['input_ids_right'] = question_right_input_ids
        score = f['similarity_score']
        # Normalize scores
        result['score'] = (score - min_score) / (max_score - min_score)
        yield result
        
# Write using TF Writer
schema = {
    "input_ids_left": ("var_len", "int"),
    "input_ids_right": ("var_len", "int"),
    "score": ("var_len", "float")
    
}

tfrecord_validation_dir = tempfile.mkdtemp()
tfrecord_validation_filename = 'sts'

tfwriter = TFWriter(schema=schema, 
                    file_name=tfrecord_validation_filename, 
                    model_dir=tfrecord_validation_dir,
                    tag='eval',
                    overwrite=True
                    )

# Train dataset
dev_parser_fn = parse_dev(sts_b, tokenizer, max_sequence_length, key='dev')
tfwriter.process(parse_fn=dev_parser_fn)

INFO:absl:Total individual observations/examples written is 1500 in 1.0107736587524414 seconds
INFO:absl:All writer objects closed

Prepare Training and Validation Dataset from TFRecords¶

# Read TFRecord

def add_mask_type_ids(item):
    
    item['input_mask_left'] = tf.ones_like(item['input_ids_left'])
    item['input_type_ids_left']= tf.zeros_like(item['input_ids_left'])
    item['input_mask_right'] = tf.ones_like(item['input_ids_right'])
    item['input_type_ids_right']= tf.zeros_like(item['input_ids_right'])
    
    labels = {}
    if 'score' in item:
        labels = {'score': item['score']}
        del item['score']
    
    return item, labels

# Train dataset
schema = json.load(open("{}/schema.json".format(tfrecord_train_dir)))
total_train_examples = json.load(open("{}/stats.json".format(tfrecord_train_dir)))['total_records']


all_files = tf.io.gfile.glob("{}/*.tfrecord".format(tfrecord_train_dir))
tf_reader = TFReader(schema=schema, 
                    tfrecord_files=all_files)

x_keys = ['input_ids_left', 'input_ids_right']
train_dataset = tf_reader.read_record(auto_batch=False, 
                                   keys=x_keys,
                                   batch_size=batch_size, 
                                   x_keys = x_keys, 
                                   shuffle=True
                                  )
train_dataset = train_dataset.map(add_mask_type_ids, num_parallel_calls=tf.data.AUTOTUNE).padded_batch(batch_size, drop_remainder=True)


# Validation dataset
val_schema = json.load(open("{}/schema.json".format(tfrecord_validation_dir)))
all_val_files = tf.io.gfile.glob("{}/*.tfrecord".format(tfrecord_validation_dir))
tf_reader_val = TFReader(schema=val_schema, 
                    tfrecord_files=all_val_files)

x_keys_val = ['input_ids_left', 'input_ids_right', 'score']
validation_dataset = tf_reader_val.read_record(auto_batch=False, 
                                   keys=x_keys_val,
                                   batch_size=batch_size, 
                                   x_keys = x_keys_val, 
                                   shuffle=True
                                  )

# Static shapes makes things faster inside tf.function
# Especially for validation as we are passing batch examples to tf.function
padded_shapes = ({'input_ids_left': [max_sequence_length,], 
                 'input_mask_left':[max_sequence_length,],
                 'input_type_ids_left':[max_sequence_length,],
                 'input_ids_right': [max_sequence_length,],
                 'input_mask_right': [max_sequence_length,],
                 'input_type_ids_right': [max_sequence_length,]
                }, 
                 {'score': [None,]})
validation_dataset = validation_dataset.map(add_mask_type_ids,
                                            num_parallel_calls=tf.data.AUTOTUNE).padded_batch(batch_size,
                                                                                              drop_remainder=False,
                                                                                              padded_shapes=padded_shapes
                                                                                              )

2022-03-23 01:10:07.501286: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-23 01:10:08.934282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30945 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0
2022-03-23 01:10:08.938622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30945 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0

Build Sentence Transformer Model¶

import tensorflow as tf
from tf_transformers.core import LegacyLayer, LegacyModel


class Sentence_Embedding_Model(LegacyLayer):
    def __init__(
        self,
        model,
        is_training=False,
        use_dropout=False,
        **kwargs,
    ):
        r"""
        Simple Sentence Embedding using Keras Layer

        Args:
            model (:obj:`LegacyLayer/LegacyModel`):
                Model.
                Eg:`~tf_transformers.model.BertModel`.
            is_training (:obj:`bool`, `optional`, defaults to False): To train
            use_dropout (:obj:`bool`, `optional`, defaults to False): Use dropout
            use_bias (:obj:`bool`, `optional`, defaults to True): use bias
        """
        super(Sentence_Embedding_Model, self).__init__(
            is_training=is_training, use_dropout=use_dropout, name=model.name, **kwargs
        )

        self.model = model
        if isinstance(model, LegacyModel):
            self.model_config = model.model_config
        elif isinstance(model, tf.keras.layers.Layer):
            self.model_config = model._config_dict
        self._is_training = is_training
        self._use_dropout = use_dropout

        # Initialize model
        self.model_inputs, self.model_outputs = self.get_model(initialize_only=True)
        
    def get_mean_embeddings(self, token_embeddings, input_mask):
        """
        Mean embeddings
        """
        cls_embeddings = token_embeddings[:, 0, :] # 0 is CLS (<s>)
        # mask PAD tokens
        token_emb_masked = token_embeddings * tf.cast(tf.expand_dims(input_mask, 2), tf.float32)
        total_non_padded_tokens_per_batch = tf.cast(tf.reduce_sum(input_mask, axis=1), tf.float32)
        # Convert to 2D
        total_non_padded_tokens_per_batch = tf.expand_dims(total_non_padded_tokens_per_batch, 1)
        mean_embeddings = tf.reduce_sum(token_emb_masked, axis=1)/ total_non_padded_tokens_per_batch
        return mean_embeddings

    def call(self, inputs):
        """Call"""
        
        # Extract left and right input pairs
        left_inputs = {k.replace('_left', ''):v for k,v in inputs.items() if 'left' in k}
        right_inputs = {k.replace('_right', ''):v for k,v in inputs.items() if 'right' in k}
        model_outputs_left = self.model(left_inputs)
        model_outputs_right = self.model(right_inputs)
        
        left_cls = model_outputs_left['cls_output']
        right_cls = model_outputs_right['cls_output']        

        left_mean_embeddings  = self.get_mean_embeddings(model_outputs_left['token_embeddings'], left_inputs['input_mask'])
        right_mean_embeddings  = self.get_mean_embeddings(model_outputs_right['token_embeddings'], right_inputs['input_mask'])
        
        cls_logits = tf.matmul(left_cls, right_cls, transpose_b=True)
        mean_logits = tf.matmul(left_mean_embeddings, right_mean_embeddings, transpose_b=True)
        
        
        results = {'left_cls_output': left_cls, 
                   'right_cls_output': right_cls, 
                   'left_mean_embeddings': left_mean_embeddings,
                   'right_mean_embeddings': right_mean_embeddings,
                   'cls_logits': cls_logits, 
                   'mean_logits': mean_logits}
        
        return results
        

    def get_model(self, initialize_only=False):
        """Get model"""
        inputs = self.model.input
        # Left and Right inputs
        main_inputs = {}
        for k, v in inputs.items():
            shape = v.shape
            main_inputs[k+'_left'] = tf.keras.layers.Input(
                            shape[1:], batch_size=v.shape[0], name=k+'_left', dtype=v.dtype
                        )
            
        for k, v in inputs.items():
            shape = v.shape
            main_inputs[k+'_right'] = tf.keras.layers.Input(
                            shape[1:], batch_size=v.shape[0], name=k+'_right', dtype=v.dtype
                        )        
        layer_outputs = self(main_inputs)
        if initialize_only:
            return main_inputs, layer_outputs
        model = LegacyModel(inputs=main_inputs, outputs=layer_outputs, name="sentence_embedding_model")
        model.model_config = self.model_config
        return model

Load Model, Optimizer , Trainer¶

Our Trainer expects model, optimizer and loss to be a function.

1. We will use Roberta as the base model and pass it to Sentence_Embedding_Model, layer we built
1. We will use in-batch loss as the loss function, where every diagonal entry in the output is positive and rest is negative

# Load Model
def get_model(model_name, is_training, use_dropout):
    """Get Model"""
    def model_fn():
        model = RobertaModel.from_pretrained(model_name)
        sentence_transformers_model = Sentence_Embedding_Model(model)
        sentence_transformers_model = sentence_transformers_model.get_model()
        return sentence_transformers_model
    return model_fn

# Load Optimizer
def get_optimizer(learning_rate, examples, batch_size, epochs, use_constant_lr=False):
    """Get optimizer"""
    steps_per_epoch = int(examples / batch_size)
    num_train_steps = steps_per_epoch * epochs
    warmup_steps = int(0.1 * num_train_steps)

    def optimizer_fn():
        optimizer, learning_rate_fn = create_optimizer(learning_rate, num_train_steps, warmup_steps, use_constant_lr=use_constant_lr)
        return optimizer

    return optimizer_fn

# Load trainer
def get_trainer(distribution_strategy, num_gpus=0, tpu_address=None):
    """Get Trainer"""
    trainer = Trainer(distribution_strategy, num_gpus=num_gpus, tpu_address=tpu_address)
    return trainer

# Create loss
def in_batch_negative_loss():
    
    def loss_fn(y_true_dict, y_pred_dict):
        
        labels = tf.range(y_pred_dict['cls_logits'].shape[0])
        cls_loss  = cross_entropy_loss_for_classification(labels=labels, logits=y_pred_dict['cls_logits'])
        mean_loss = cross_entropy_loss_for_classification(labels=labels, logits=y_pred_dict['mean_logits'])
        
        result = {}
        result['cls_loss'] = cls_loss
        result['mean_loss'] = mean_loss
        result['loss'] = (cls_loss + mean_loss)/2.0
        return result
    
    return loss_fn

Wandb Configuration¶

project = "TUTORIALS"
display_name = "roberta_quora_sentence_embedding"
wandb.init(project=project, name=display_name)

Zero-Shot on STS before Training¶

1. Lets evaluate how good Roberta is to capture sentence embeddings before fine-tuning with Quora.
1. This gives us an indication whether the model is learning something or not on downstream fine-tuning.
1. We use CLS_OUTPUT, pooler output of Roberta model as sentence embedding and evaluate using pearson and spearman correlation.

from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
from scipy.stats import pearsonr, spearmanr

model = RobertaModel.from_pretrained(model_name)

sentence1_embeddings = []
sentence2_embeddings = []
sts_labels = []
for batch_inputs, batch_labels in tqdm.tqdm(validation_dataset):
    left_inputs = {k.replace('_left', ''):v for k,v in batch_inputs.items() if 'left' in k}
    right_inputs = {k.replace('_right', ''):v for k,v in batch_inputs.items() if 'right' in k}
    left_outputs = model(left_inputs)
    right_outputs = model(right_inputs)
    
    # sentence 1 embeddings
    sentence1_embeddings.append(left_outputs['cls_output'])
    # sentence 2 embeddings
    sentence2_embeddings.append(right_outputs['cls_output'])
    sts_labels.append(batch_labels['score'])
    
sts_labels = tf.squeeze(tf.concat(sts_labels, axis=0), axis=1)
sentence1_embeddings = tf.concat(sentence1_embeddings, axis=0)
sentence2_embeddings = tf.concat(sentence2_embeddings, axis=0)

cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
dot_products        = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]


eval_pearson_cosine, _    = pearsonr(sts_labels, cosine_scores)
eval_spearman_cosine, _   = spearmanr(sts_labels, cosine_scores)

eval_pearson_manhattan, _  = pearsonr(sts_labels, manhattan_distances)
eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)

eval_pearson_euclidean, _  = pearsonr(sts_labels, euclidean_distances)
eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)

eval_pearson_dot, _  = pearsonr(sts_labels, dot_products)
eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)


print("Cosine-Similarity :\tPearson: {:.4f}\tSpearman: {:.4f}".format(
    eval_pearson_cosine, eval_spearman_cosine))
print("Manhattan-Distance:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
    eval_pearson_manhattan, eval_spearman_manhattan))
print("Euclidean-Distance:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
    eval_pearson_euclidean, eval_spearman_euclidean))
print("Dot-Product-Similarity:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
    eval_pearson_dot, eval_spearman_dot))

INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__roberta-base-no-mlm.main.9e4aa91ba5936c6ac98586f85c152831e421d0ec/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/roberta-base-no-mlm
12it [00:12,  1.08s/it]

Cosine-Similarity :	Pearson: 0.4278	Spearman: 0.5293
Manhattan-Distance:	Pearson: 0.4329	Spearman: 0.5120
Euclidean-Distance:	Pearson: 0.4365	Spearman: 0.5125
Dot-Product-Similarity:	Pearson: -0.0079	Spearman: -0.0050

import tqdm
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
from scipy.stats import pearsonr, spearmanr

class STSEvaluationCallback:
    def __init__(self) -> None:
        pass

    def __call__(self, trainer_kwargs):

        validation_dataset_distributed = iter(
            trainer_kwargs["validation_dataset_distributed"]
        )
        model = trainer_kwargs["model"]
        wandb = trainer_kwargs["wandb"]
        step = trainer_kwargs["global_step"]
        strategy = trainer_kwargs["strategy"]
        epoch = trainer_kwargs["epoch"]
        epochs = trainer_kwargs["epochs"]
        validation_steps = trainer_kwargs["validation_steps"]

        if validation_dataset_distributed is None:
            raise ValueError(
                "No validation dataset has been provided either in the trainer class, \
                                 or when callback is initialized. Please provide a validation dataset"
            )

        @tf.function
        def validate_run(dist_inputs):
            batch_inputs, batch_labels = dist_inputs
            model_outputs = model(batch_inputs)
            s1_cls = model_outputs['left_cls_output']
            s2_cls = model_outputs['right_cls_output']
            
            s1_mean = model_outputs['left_mean_embeddings']
            s2_mean = model_outputs['right_mean_embeddings']
            return s1_cls, s2_cls, s1_mean, s2_mean, batch_labels['score']
        
        S1_cls = []
        S2_cls = []
        S1_mean = []
        S2_mean = []
        sts_labels = []
        # This is a hack to make tqdm to print colour bar
        # TODO: fix it .
        pbar = tqdm.trange(validation_steps, colour="magenta", unit="batch")
        for step_counter in pbar:
            dist_inputs = next(validation_dataset_distributed)
            s1_cls, s2_cls, s1_mean, s2_mean, batch_scores = strategy.run(
                validate_run, args=(dist_inputs,)
            )
            s1_cls = tf.concat(
                trainer.distribution_strategy.experimental_local_results(s1_cls),
                axis=0,
            )
            s2_cls = tf.concat(
                            trainer.distribution_strategy.experimental_local_results(s2_cls),
                            axis=0,
                        )
            s1_mean = tf.concat(
                            trainer.distribution_strategy.experimental_local_results(s1_mean),
                            axis=0,
                        )
            s2_mean = tf.concat(
                                trainer.distribution_strategy.experimental_local_results(s2_mean),
                                        axis=0,
                                    )
            
            scores = tf.concat(
                trainer.distribution_strategy.experimental_local_results(
                    batch_scores
                ),
                axis=0,
            )

            S1_cls.append(s1_cls)
            S2_cls.append(s2_cls)
            S1_mean.append(s1_mean)
            S2_mean.append(s2_mean)
            sts_labels.append(scores)
            pbar.set_description(
                "Callback: Epoch {}/{} --- Step {}/{} ".format(
                    epoch, epochs, step_counter, validation_steps
                )
            )
            
            
        sts_labels = tf.squeeze(tf.concat(sts_labels, axis=0), axis=1)
        sentence1_embeddings = tf.concat(S1_cls, axis=0)
        sentence2_embeddings = tf.concat(S2_cls, axis=0)

        cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
        manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
        euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
        dot_products        = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]


        eval_pearson_cosine, _    = pearsonr(sts_labels, cosine_scores)
        eval_spearman_cosine, _   = spearmanr(sts_labels, cosine_scores)

        eval_pearson_manhattan, _  = pearsonr(sts_labels, manhattan_distances)
        eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)

        eval_pearson_euclidean, _  = pearsonr(sts_labels, euclidean_distances)
        eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)

        eval_pearson_dot, _  = pearsonr(sts_labels, dot_products)
        eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)

        metrics_result = {'pearson_cosine_cls': eval_pearson_cosine,
                          'spearman_cosine_cls': eval_spearman_cosine,
                          'pearson_manhattan_cls': eval_pearson_manhattan, 
                          'spearman_manhattan_cls': eval_spearman_manhattan, 
                          'pearson_euclidean_cls': eval_pearson_euclidean, 
                          'spearman_euclidean_cls': eval_spearman_euclidean, 
                          'pearson_dot_cls': eval_pearson_dot, 
                          'spearman_dot_cls': eval_spearman_dot}
        
        sentence1_embeddings = tf.concat(S1_mean, axis=0)
        sentence2_embeddings = tf.concat(S2_mean, axis=0)

        cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
        manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
        euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
        dot_products        = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]


        eval_pearson_cosine, _    = pearsonr(sts_labels, cosine_scores)
        eval_spearman_cosine, _   = spearmanr(sts_labels, cosine_scores)

        eval_pearson_manhattan, _  = pearsonr(sts_labels, manhattan_distances)
        eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)

        eval_pearson_euclidean, _  = pearsonr(sts_labels, euclidean_distances)
        eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)

        eval_pearson_dot, _  = pearsonr(sts_labels, dot_products)
        eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)
        
        metrics_result_mean = {'pearson_cosine_mean': eval_pearson_cosine,
                          'spearman_cosine_mean': eval_spearman_cosine,
                          'pearson_manhattan_mean': eval_pearson_manhattan, 
                          'spearman_manhattan_mean': eval_spearman_manhattan, 
                          'pearson_euclidean_mean': eval_pearson_euclidean, 
                          'spearman_euclidean_mean': eval_spearman_euclidean, 
                          'pearson_dot_mean': eval_pearson_dot, 
                          'spearman_dot_mean': eval_spearman_dot}
        
        metrics_result.update(metrics_result_mean)
        pbar.set_postfix(**metrics_result)
        
        if wandb:
            wandb.log(metrics_result, step=step)

        return metrics_result

Set Hyperparameters and Configs¶

Set necessay hyperparameters.
Prepare train dataset, validation dataset.
Load model, optimizer, loss and trainer.

# Model configs
learning_rate = 2e-5
epochs = 3
model_checkpoint_dir = 'MODELS/roberta_quora_embeddings'


# Total train examples
steps_per_epoch = total_train_examples // batch_size

# model
model_fn =  get_model(model_name, is_training=True, use_dropout=True)
# optimizer
optimizer_fn = get_optimizer(learning_rate, total_train_examples, batch_size, epochs)
# trainer (multi gpu strategy)
trainer = get_trainer(distribution_strategy='mirrored', num_gpus=2)
# loss
loss_fn = in_batch_negative_loss()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

Train :-)¶

1. Loss is coming down in epoch 1 itself.
1. Zershot evaluation after epoch 1 shows that, pearson and spearman correlation increases to 0.80, which is significant improvemnet over Roberta base model, where we got 0.43.
1. Without training on STS-B, we got a good evaluation score on STS-B dev using Zeroshot.

sts_callback = STSEvaluationCallback()
history = trainer.run(
    model_fn=model_fn,
    optimizer_fn=optimizer_fn,
    train_dataset=train_dataset,
    train_loss_fn=loss_fn,
    epochs=epochs,
    steps_per_epoch=steps_per_epoch,
    model_checkpoint_dir=model_checkpoint_dir,
    batch_size=batch_size,
    validation_dataset=validation_dataset,
    validation_loss_fn=loss_fn,
    training_loss_names = ['cls_loss', 'mean_loss'],
    validation_loss_names = ['cls_loss', 'mean_loss'],
    steps_per_call=10,
    callbacks=[sts_callback],
    wandb=wandb
)

INFO:absl:Make sure `steps_per_epoch` should be less than or equal to number of batches in dataset.
INFO:absl:Policy: ----> float32
INFO:absl:Strategy: ---> <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f0b5c0287d0>
INFO:absl:Num GPU Devices: ---> 2
INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__roberta-base-no-mlm.main.9e4aa91ba5936c6ac98586f85c152831e421d0ec/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/roberta-base-no-mlm
INFO:absl:Using linear optimization warmup
INFO:absl:Using Adamw optimizer
INFO:absl:No ❌❌ checkpoint found in MODELS/roberta_quora_embeddings
Train: Epoch 1/4 --- Step 10/3158 --- total examples 0 , trainable variables 199:   0%|          | 0/315 [00:00<?, ?batch /s]

INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1

INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1

INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices

WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Train: Epoch 1/4 --- Step 3150/3158 --- total examples 401920 , trainable variables 199: 100%|██████████| 315/315 [27:15<00:00,  5.19s/batch , _runtime=1803, _timestamp=1.65e+9, cls_loss=0.504, learning_rate=1.34e-5, loss=0.493, mean_loss=0.482]
INFO:absl:Model saved at epoch 1 at MODELS/roberta_quora_embeddings/ckpt-1

  0%|          | 0/12 [00:00<?, ?batch /s]

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Validation: Epoch 1/4 --- Step 0/12 :   8%|▊         | 1/12 [00:11<02:07, 11.57s/batch , cls_loss=2.63, loss=2.8, mean_loss=2.96]

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Validation: Epoch 1/4 --- Step 11/12 : 100%|██████████| 12/12 [00:24<00:00,  2.07s/batch , cls_loss=1.62, loss=1.65, mean_loss=1.68]
INFO:absl:Validation result at epcoh 1 and                 global step 3150 is {'cls_loss': 1.6162163, 'mean_loss': 1.6796235, 'loss': 1.6479198}
INFO:absl:Callbacks in progress at epoch end 1 . . . .

Callback: Epoch 1/3 --- Step 11/12 : 100%|██████████| 12/12 [00:21<00:00,  1.76s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8199941582015672, 'spearman_cosine_cls': 0.8220972132455343, 'pearson_manhattan_cls': 0.8184392097896854, 'spearman_manhattan_cls': 0.8164570492108482, 'pearson_euclidean_cls': 0.8190927383440411, 'spearman_euclidean_cls': 0.8172409394315345, 'pearson_dot_cls': 0.7828011052166998, 'spearman_dot_cls': 0.7807641366784325, 'pearson_cosine_mean': 0.8151801531088095, 'spearman_cosine_mean': 0.8162854946579012, 'pearson_manhattan_mean': 0.8145520799669964, 'spearman_manhattan_mean': 0.8123405339144811, 'pearson_euclidean_mean': 0.8148764479876582, 'spearman_euclidean_mean': 0.8132354135356057, 'pearson_dot_mean': 0.7381403305760383, 'spearman_dot_mean': 0.7337835384203213, '_timestamp': 1647999708, '_runtime': 1854} at epoch 1
wandb: WARNING Step must only increase in log calls.  Step 1 < 3150; dropping {'pearson_cosine_cls': 0.8199941582015672, 'spearman_cosine_cls': 0.8220972132455343, 'pearson_manhattan_cls': 0.8184392097896854, 'spearman_manhattan_cls': 0.8164570492108482, 'pearson_euclidean_cls': 0.8190927383440411, 'spearman_euclidean_cls': 0.8172409394315345, 'pearson_dot_cls': 0.7828011052166998, 'spearman_dot_cls': 0.7807641366784325, 'pearson_cosine_mean': 0.8151801531088095, 'spearman_cosine_mean': 0.8162854946579012, 'pearson_manhattan_mean': 0.8145520799669964, 'spearman_manhattan_mean': 0.8123405339144811, 'pearson_euclidean_mean': 0.8148764479876582, 'spearman_euclidean_mean': 0.8132354135356057, 'pearson_dot_mean': 0.7381403305760383, 'spearman_dot_mean': 0.7337835384203213, '_timestamp': 1647999708, '_runtime': 1854}.

Train: Epoch 2/4 --- Step 3150/3158 --- total examples 805120 , trainable variables 199: 100%|██████████| 315/315 [26:13<00:00,  5.00s/batch , _runtime=3428, _timestamp=1.65e+9, cls_loss=0.304, learning_rate=6.71e-6, loss=0.305, mean_loss=0.305]
INFO:absl:Model saved at epoch 2 at MODELS/roberta_quora_embeddings/ckpt-2

Validation: Epoch 2/4 --- Step 11/12 : 100%|██████████| 12/12 [00:05<00:00,  2.27batch /s, cls_loss=1.78, loss=1.82, mean_loss=1.85]
INFO:absl:Validation result at epcoh 2 and                 global step 6300 is {'cls_loss': 1.778288, 'mean_loss': 1.8532048, 'loss': 1.8157464}
INFO:absl:Callbacks in progress at epoch end 2 . . . .

Callback: Epoch 2/3 --- Step 11/12 : 100%|██████████| 12/12 [00:19<00:00,  1.61s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8082112012752523, 'spearman_cosine_cls': 0.8088788767212841, 'pearson_manhattan_cls': 0.7977193919551161, 'spearman_manhattan_cls': 0.79662337716043, 'pearson_euclidean_cls': 0.7982407615058535, 'spearman_euclidean_cls': 0.7970524557483568, 'pearson_dot_cls': 0.7645641878510724, 'spearman_dot_cls': 0.7678639160320804, 'pearson_cosine_mean': 0.8030011391493671, 'spearman_cosine_mean': 0.8044760711917577, 'pearson_manhattan_mean': 0.7959895612836713, 'spearman_manhattan_mean': 0.7952571982816723, 'pearson_euclidean_mean': 0.7974056893147314, 'spearman_euclidean_mean': 0.7970287600024667, 'pearson_dot_mean': 0.7324014153178778, 'spearman_dot_mean': 0.7335963354441554, '_timestamp': 1648001312, '_runtime': 3458} at epoch 2
wandb: WARNING Step must only increase in log calls.  Step 2 < 6300; dropping {'pearson_cosine_cls': 0.8082112012752523, 'spearman_cosine_cls': 0.8088788767212841, 'pearson_manhattan_cls': 0.7977193919551161, 'spearman_manhattan_cls': 0.79662337716043, 'pearson_euclidean_cls': 0.7982407615058535, 'spearman_euclidean_cls': 0.7970524557483568, 'pearson_dot_cls': 0.7645641878510724, 'spearman_dot_cls': 0.7678639160320804, 'pearson_cosine_mean': 0.8030011391493671, 'spearman_cosine_mean': 0.8044760711917577, 'pearson_manhattan_mean': 0.7959895612836713, 'spearman_manhattan_mean': 0.7952571982816723, 'pearson_euclidean_mean': 0.7974056893147314, 'spearman_euclidean_mean': 0.7970287600024667, 'pearson_dot_mean': 0.7324014153178778, 'spearman_dot_mean': 0.7335963354441554, '_timestamp': 1648001312, '_runtime': 3458}.

Train: Epoch 3/4 --- Step 3150/3158 --- total examples 1208320 , trainable variables 199: 100%|██████████| 315/315 [26:08<00:00,  4.98s/batch , _runtime=5027, _timestamp=1.65e+9, cls_loss=0.275, learning_rate=6.02e-8, loss=0.278, mean_loss=0.282]
INFO:absl:Model saved at epoch 3 at MODELS/roberta_quora_embeddings/ckpt-3

Validation: Epoch 3/4 --- Step 11/12 : 100%|██████████| 12/12 [00:05<00:00,  2.27batch /s, cls_loss=2.3, loss=2.29, mean_loss=2.28] 
INFO:absl:Validation result at epcoh 3 and                 global step 9450 is {'cls_loss': 2.2950742, 'mean_loss': 2.2835078, 'loss': 2.2892911}
INFO:absl:Callbacks in progress at epoch end 3 . . . .

Callback: Epoch 3/3 --- Step 11/12 : 100%|██████████| 12/12 [00:18<00:00,  1.56s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8118276970012877, 'spearman_cosine_cls': 0.8110754654257855, 'pearson_manhattan_cls': 0.7893045752002403, 'spearman_manhattan_cls': 0.7901086696302247, 'pearson_euclidean_cls': 0.789695365243242, 'spearman_euclidean_cls': 0.7900715861621009, 'pearson_dot_cls': 0.7764208929832053, 'spearman_dot_cls': 0.7831285760325771, 'pearson_cosine_mean': 0.8074910790403766, 'spearman_cosine_mean': 0.8084473888790257, 'pearson_manhattan_mean': 0.792546459118103, 'spearman_manhattan_mean': 0.794987013041834, 'pearson_euclidean_mean': 0.7943711503130662, 'spearman_euclidean_mean': 0.7970291923871069, 'pearson_dot_mean': 0.7619295041302732, 'spearman_dot_mean': 0.7644860560497375, '_timestamp': 1648002910, '_runtime': 5056} at epoch 3
wandb: WARNING Step must only increase in log calls.  Step 3 < 9450; dropping {'pearson_cosine_cls': 0.8118276970012877, 'spearman_cosine_cls': 0.8110754654257855, 'pearson_manhattan_cls': 0.7893045752002403, 'spearman_manhattan_cls': 0.7901086696302247, 'pearson_euclidean_cls': 0.789695365243242, 'spearman_euclidean_cls': 0.7900715861621009, 'pearson_dot_cls': 0.7764208929832053, 'spearman_dot_cls': 0.7831285760325771, 'pearson_cosine_mean': 0.8074910790403766, 'spearman_cosine_mean': 0.8084473888790257, 'pearson_manhattan_mean': 0.792546459118103, 'spearman_manhattan_mean': 0.794987013041834, 'pearson_euclidean_mean': 0.7943711503130662, 'spearman_euclidean_mean': 0.7970291923871069, 'pearson_dot_mean': 0.7619295041302732, 'spearman_dot_mean': 0.7644860560497375, '_timestamp': 1648002910, '_runtime': 5056}.

Visualize the Tensorboard¶

%load_ext tensorboard

%tensorboard --logdir MODELS/roberta_quora_embeddings/logs

Load Trained Model for Testing and Save it as serialzed model¶

1. To get good sentence embedding , we need only Roberta model, which has been used as the base for Sentence_Embedding_Model .

# Save serialized version of the model

# Note: Ignore checkpoint warnings, it is because we save optimizer with checkpoint
# while we restoring, we take only model.


model_fn =  get_model(model_name, is_training=False, use_dropout=False)
model = model_fn()
model.load_checkpoint(model_checkpoint_dir)

# Roberta base (model.layers[-1] is Sentence_Embedding_Model )
model = model.layers[-1].model
model.save_transformers_serialized('{}/saved_model/'.format(model_checkpoint_dir))

WARNING:absl:Found untraced functions such as word_embeddings_layer_call_fn, word_embeddings_layer_call_and_return_conditional_losses, type_embeddings_layer_call_fn, type_embeddings_layer_call_and_return_conditional_losses, positional_embeddings_layer_call_fn while saving (showing 5 of 870). These functions will not be directly callable after loading.

INFO:tensorflow:Assets written to: MODELS/roberta_quora_embeddings/saved_model/assets

INFO:tensorflow:Assets written to: MODELS/roberta_quora_embeddings/saved_model/assets

Model Serialization (Production)¶

1. Lets see how we can use this model to extract sentence embeddings
1. Print top K similar sentences from our embeddings from Quora Dataset

# Load serialized model

loaded = tf.saved_model.load("{}/saved_model/".format(model_checkpoint_dir))
model = loaded.signatures['serving_default']

# Take 100000 sentences from Quora and calculate embeddings of that
quora_questions = []
for item in dataset['train']:
    quora_questions.extend(item['questions']['text'])
    
quora_questions = list(set(quora_questions))
quora_questions = quora_questions[:100000] # Take 100000
print("Total sentences {}".format(len(quora_questions)))

# Prepare Dataset
quora_dataset = tf.data.Dataset.from_tensor_slices({'questions': quora_questions})
quora_dataset = quora_dataset.batch(batch_size, drop_remainder=False)

Total sentences 100000

Quora Sentence Embeddings¶

quora_sentence_embeddings = []
for batch_questions in tqdm.tqdm(quora_dataset):
    batch_questions = batch_questions['questions'].numpy().tolist()
    batch_questions = [q.decode() for q in batch_questions]
    
    # Tokenize
    quora_inputs = tokenizer(batch_questions, max_length=max_sequence_length, padding=True, truncation=True, return_tensors='tf')
    quora_inputs['input_mask'] = quora_inputs['attention_mask']
    quora_inputs['input_type_ids'] = tf.zeros_like(quora_inputs['input_ids'])
    del quora_inputs['attention_mask'] # we dont want this

    model_outputs = model(**quora_inputs)
    quora_sentence_embeddings.append(model_outputs['cls_output'])
    
# Pack and Normalize
quora_sentence_embeddings = tf.nn.l2_normalize(tf.concat(quora_sentence_embeddings, axis=0), axis=-1)

100%|██████████| 782/782 [03:30<00:00,  3.71it/s]

Most Similar Sentences¶

def most_similar(input_question, top_k=10):
    quora_inputs = tokenizer([input_question], max_length=max_sequence_length, padding=True, truncation=True, return_tensors='tf')
    quora_inputs['input_mask'] = quora_inputs['attention_mask']
    quora_inputs['input_type_ids'] = tf.zeros_like(quora_inputs['input_ids'])
    del quora_inputs['attention_mask'] # we dont want this
    model_outputs = model(**quora_inputs)
    query_vector = model_outputs['cls_output']
    query_vector = tf.nn.l2_normalize(query_vector, axis=1)

    scores = tf.matmul(query_vector, quora_sentence_embeddings, transpose_b=True)
    top_k_values = tf.nn.top_k(scores, k=top_k)
    for i in range(top_k):
        best_index = top_k_values.indices.numpy()[0][i]
        best_prob = top_k_values.values.numpy()[0][i]
        print(quora_questions[best_index], '-->', best_prob)

input_question = 'What is the best way to propose a girl?'
most_similar(input_question)

How should I propose a girl? --> 0.9225553
How do I propose to a girl? --> 0.8723614
Which is the most romantic way to propose a girl? --> 0.8557819
How do I propose a girl for sex? --> 0.80544627
How did you propose your girlfriend? --> 0.69494146
What are some of the best and unique ways to propose marriage? --> 0.6611091
How can I propose to my crush? --> 0.64724606
If I want to propose to a girl should I give her hints in advance? --> 0.6309003
What doesit take for a man to propose? --> 0.6253518
What is the right time to propose someone ? --> 0.5932445

input_question = 'How can I start learning Deep Learning?'
most_similar(input_question)

What's the most effective way to get started with Deep Learning? --> 0.83670104
How do I learn deep learning in 1 month? --> 0.7423597
Why is deep learning so important in machine learning? --> 0.7327932
Does Quora use Deep Learning? --> 0.7260519
Should a machine learning beginner go straight for deep learning? --> 0.719143
Where should I start for machine learning? --> 0.71324116
How do i get started on machine learning? --> 0.7123989
I am New to Deep Learning. How do I start with Python? --> 0.710938
How do I start learning machine learning? --> 0.7106862
What is deep learning? How is related to AI and machine learning? --> 0.70124084

input_question = 'Best tourist destinations in India'
most_similar(input_question)

What are the must-visit and affordable tourist destinations in India? --> 0.8237094
What is the most overrated tourist destination in India? --> 0.7374817
What is the best sex tourism destination in India? --> 0.73366314
What are the most popular tourist destinations? --> 0.7160436
What are the best destination for a solo traveler in India? --> 0.7078299
What is the best holiday destination? --> 0.675949
Which places I should not visit in India as a Indian? --> 0.6656152
What are the best places to go as a tourist? --> 0.66551954
Which are some best places to visit in India? --> 0.66457677
Which is your best holiday destination? --> 0.6640895

input_question = 'Why classical music is so relaxing?'
most_similar(input_question)

What is your favourite piece of classical music and why? --> 0.75282526
What are the benefits of listening to classical music? --> 0.7361862
Why do some people only listen to classical music? --> 0.7289536
Which music is the best for relaxation? --> 0.6762159
Why is classical music better than most pop music? --> 0.6651089
What are some classical and operant conditioning in education? --> 0.64240026
Classical music in movies? --> 0.6344438
Which classic music is this? --> 0.59156764
Which ones are some of the most soothing tunes composed on a piano? --> 0.57644486
What are the differences between Hindustani classical music and Carnatic music? --> 0.57415533