Create Sentence Embedding Roberta Model + Zeroshot from Scratch¶
This tutorial contains complete code to fine-tune Roberta to build meaningful sentence transformers using Quora Dataset from HuggingFace. In addition to training a model, you will learn how to preprocess text into an appropriate format.
In this notebook, you will:
Load the Quora dataset from HuggingFace
Load Roberta Model using tf-transformers
Build train and validation dataset feature preparation using tokenizer from transformers.
Build your own model by combining Roberta with a CustomWrapper
Train your own model, fine-tuning Roberta as part of that
Save your model and use it to extract sentence embeddings
Use the end-to-end (inference) in production setup
If you’re new to working with the Quora dataset, please see QUORA for more details.
!pip install tf-transformers
!pip install transformers
!pip install wandb
!pip install datasets
import tensorflow as tf
import random
import collections
import wandb
import tempfile
import tqdm
import json
import os
import numpy as np
print("Tensorflow version", tf.__version__)
print("Devices", tf.config.list_physical_devices())
from tf_transformers.models import RobertaModel, Classification_Model
from tf_transformers.core import Trainer
from tf_transformers.optimization import create_optimizer
from tf_transformers.data import TFWriter, TFReader
from tf_transformers.losses import cross_entropy_loss_for_classification
from datasets import load_dataset
from transformers import RobertaTokenizer
Tensorflow version 2.7.0
Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
# Load Dataset
model_name = 'roberta-base'
dataset = load_dataset("quora")
tokenizer = RobertaTokenizer.from_pretrained(model_name)
# Load validation dataset
sts_b = load_dataset("stsb_multi_mt", 'en')
# Define length for examples
max_sequence_length = 128
batch_size = 128
Using custom data configuration default
Reusing dataset quora (/home/jovyan/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)
Reusing dataset stsb_multi_mt (/home/jovyan/.cache/huggingface/datasets/stsb_multi_mt/en/1.0.0/a5d260e4b7aa82d1ab7379523a005a366d9b124c76a5a5cf0c4c5365458b0ba9)
Prepare Training TFRecords using Quora¶
Download Quora dataset.
We will take only those row where
is_duplicate=True
. The model will be trained usingin-batch
negative loss.
Example data looks like a pair of sentences
sentence1 (left sentence): What is the best Android smartphone?, sentence2 (right sentence): What is the best Android smartphone ever?
def parse_train(dataset, tokenizer, max_passage_length, key):
"""Function to parse examples which are is_duplicate=1
Args:
dataset (:obj:`dataet`): HF dataset
tokenizer (:obj:`tokenizer`): HF Tokenizer
max_passage_length (:obj:`int`): Passage Length
key (:obj:`str`): Key of dataset (`train`, `validation` etc)
"""
result = {}
for f in dataset[key]:
question_left , question_right = f['questions']['text']
question_left_input_ids = tokenizer(question_left, max_length=max_passage_length, truncation=True)['input_ids']
question_right_input_ids = tokenizer(question_right, max_length=max_passage_length, truncation=True)['input_ids']
result = {}
result['input_ids_left'] = question_left_input_ids
result['input_ids_right'] = question_right_input_ids
yield result
# Write using TF Writer
schema = {
"input_ids_left": ("var_len", "int"),
"input_ids_right": ("var_len", "int")
}
tfrecord_train_dir = tempfile.mkdtemp()
tfrecord_filename = 'quora'
tfwriter = TFWriter(schema=schema,
file_name=tfrecord_filename,
model_dir=tfrecord_train_dir,
tag='train',
overwrite=True
)
# Train dataset
train_parser_fn = parse_train(dataset, tokenizer, max_sequence_length, key='train')
tfwriter.process(parse_fn=train_parser_fn)
INFO:absl:Total individual observations/examples written is 404290 in 276.39959359169006 seconds
INFO:absl:All writer objects closed
Prepare Validation TFRecords using STS-b¶
Download STS dataset.
We will use this dataset to measure sentence embeddings by measuring the correlation
def parse_dev(dataset, tokenizer, max_passage_length, key):
"""Function to parse examples which are is_duplicate=1
Args:
dataset (:obj:`dataet`): HF dataset
tokenizer (:obj:`tokenizer`): HF Tokenizer
max_passage_length (:obj:`int`): Passage Length
key (:obj:`str`): Key of dataset (`train`, `validation` etc)
"""
result = {}
max_score = 5.0
min_score = 0.0
for f in dataset[key]:
question_left = f['sentence1']
question_right = f['sentence2']
question_left_input_ids = tokenizer(question_left, max_length=max_passage_length, truncation=True)['input_ids']
question_right_input_ids = tokenizer(question_right, max_length=max_passage_length, truncation=True)['input_ids']
result = {}
result['input_ids_left'] = question_left_input_ids
result['input_ids_right'] = question_right_input_ids
score = f['similarity_score']
# Normalize scores
result['score'] = (score - min_score) / (max_score - min_score)
yield result
# Write using TF Writer
schema = {
"input_ids_left": ("var_len", "int"),
"input_ids_right": ("var_len", "int"),
"score": ("var_len", "float")
}
tfrecord_validation_dir = tempfile.mkdtemp()
tfrecord_validation_filename = 'sts'
tfwriter = TFWriter(schema=schema,
file_name=tfrecord_validation_filename,
model_dir=tfrecord_validation_dir,
tag='eval',
overwrite=True
)
# Train dataset
dev_parser_fn = parse_dev(sts_b, tokenizer, max_sequence_length, key='dev')
tfwriter.process(parse_fn=dev_parser_fn)
INFO:absl:Total individual observations/examples written is 1500 in 1.0107736587524414 seconds
INFO:absl:All writer objects closed
Prepare Training and Validation Dataset from TFRecords¶
# Read TFRecord
def add_mask_type_ids(item):
item['input_mask_left'] = tf.ones_like(item['input_ids_left'])
item['input_type_ids_left']= tf.zeros_like(item['input_ids_left'])
item['input_mask_right'] = tf.ones_like(item['input_ids_right'])
item['input_type_ids_right']= tf.zeros_like(item['input_ids_right'])
labels = {}
if 'score' in item:
labels = {'score': item['score']}
del item['score']
return item, labels
# Train dataset
schema = json.load(open("{}/schema.json".format(tfrecord_train_dir)))
total_train_examples = json.load(open("{}/stats.json".format(tfrecord_train_dir)))['total_records']
all_files = tf.io.gfile.glob("{}/*.tfrecord".format(tfrecord_train_dir))
tf_reader = TFReader(schema=schema,
tfrecord_files=all_files)
x_keys = ['input_ids_left', 'input_ids_right']
train_dataset = tf_reader.read_record(auto_batch=False,
keys=x_keys,
batch_size=batch_size,
x_keys = x_keys,
shuffle=True
)
train_dataset = train_dataset.map(add_mask_type_ids, num_parallel_calls=tf.data.AUTOTUNE).padded_batch(batch_size, drop_remainder=True)
# Validation dataset
val_schema = json.load(open("{}/schema.json".format(tfrecord_validation_dir)))
all_val_files = tf.io.gfile.glob("{}/*.tfrecord".format(tfrecord_validation_dir))
tf_reader_val = TFReader(schema=val_schema,
tfrecord_files=all_val_files)
x_keys_val = ['input_ids_left', 'input_ids_right', 'score']
validation_dataset = tf_reader_val.read_record(auto_batch=False,
keys=x_keys_val,
batch_size=batch_size,
x_keys = x_keys_val,
shuffle=True
)
# Static shapes makes things faster inside tf.function
# Especially for validation as we are passing batch examples to tf.function
padded_shapes = ({'input_ids_left': [max_sequence_length,],
'input_mask_left':[max_sequence_length,],
'input_type_ids_left':[max_sequence_length,],
'input_ids_right': [max_sequence_length,],
'input_mask_right': [max_sequence_length,],
'input_type_ids_right': [max_sequence_length,]
},
{'score': [None,]})
validation_dataset = validation_dataset.map(add_mask_type_ids,
num_parallel_calls=tf.data.AUTOTUNE).padded_batch(batch_size,
drop_remainder=False,
padded_shapes=padded_shapes
)
2022-03-23 01:10:07.501286: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-23 01:10:08.934282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30945 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0
2022-03-23 01:10:08.938622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30945 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
Build Sentence Transformer Model¶
import tensorflow as tf
from tf_transformers.core import LegacyLayer, LegacyModel
class Sentence_Embedding_Model(LegacyLayer):
def __init__(
self,
model,
is_training=False,
use_dropout=False,
**kwargs,
):
r"""
Simple Sentence Embedding using Keras Layer
Args:
model (:obj:`LegacyLayer/LegacyModel`):
Model.
Eg:`~tf_transformers.model.BertModel`.
is_training (:obj:`bool`, `optional`, defaults to False): To train
use_dropout (:obj:`bool`, `optional`, defaults to False): Use dropout
use_bias (:obj:`bool`, `optional`, defaults to True): use bias
"""
super(Sentence_Embedding_Model, self).__init__(
is_training=is_training, use_dropout=use_dropout, name=model.name, **kwargs
)
self.model = model
if isinstance(model, LegacyModel):
self.model_config = model.model_config
elif isinstance(model, tf.keras.layers.Layer):
self.model_config = model._config_dict
self._is_training = is_training
self._use_dropout = use_dropout
# Initialize model
self.model_inputs, self.model_outputs = self.get_model(initialize_only=True)
def get_mean_embeddings(self, token_embeddings, input_mask):
"""
Mean embeddings
"""
cls_embeddings = token_embeddings[:, 0, :] # 0 is CLS (<s>)
# mask PAD tokens
token_emb_masked = token_embeddings * tf.cast(tf.expand_dims(input_mask, 2), tf.float32)
total_non_padded_tokens_per_batch = tf.cast(tf.reduce_sum(input_mask, axis=1), tf.float32)
# Convert to 2D
total_non_padded_tokens_per_batch = tf.expand_dims(total_non_padded_tokens_per_batch, 1)
mean_embeddings = tf.reduce_sum(token_emb_masked, axis=1)/ total_non_padded_tokens_per_batch
return mean_embeddings
def call(self, inputs):
"""Call"""
# Extract left and right input pairs
left_inputs = {k.replace('_left', ''):v for k,v in inputs.items() if 'left' in k}
right_inputs = {k.replace('_right', ''):v for k,v in inputs.items() if 'right' in k}
model_outputs_left = self.model(left_inputs)
model_outputs_right = self.model(right_inputs)
left_cls = model_outputs_left['cls_output']
right_cls = model_outputs_right['cls_output']
left_mean_embeddings = self.get_mean_embeddings(model_outputs_left['token_embeddings'], left_inputs['input_mask'])
right_mean_embeddings = self.get_mean_embeddings(model_outputs_right['token_embeddings'], right_inputs['input_mask'])
cls_logits = tf.matmul(left_cls, right_cls, transpose_b=True)
mean_logits = tf.matmul(left_mean_embeddings, right_mean_embeddings, transpose_b=True)
results = {'left_cls_output': left_cls,
'right_cls_output': right_cls,
'left_mean_embeddings': left_mean_embeddings,
'right_mean_embeddings': right_mean_embeddings,
'cls_logits': cls_logits,
'mean_logits': mean_logits}
return results
def get_model(self, initialize_only=False):
"""Get model"""
inputs = self.model.input
# Left and Right inputs
main_inputs = {}
for k, v in inputs.items():
shape = v.shape
main_inputs[k+'_left'] = tf.keras.layers.Input(
shape[1:], batch_size=v.shape[0], name=k+'_left', dtype=v.dtype
)
for k, v in inputs.items():
shape = v.shape
main_inputs[k+'_right'] = tf.keras.layers.Input(
shape[1:], batch_size=v.shape[0], name=k+'_right', dtype=v.dtype
)
layer_outputs = self(main_inputs)
if initialize_only:
return main_inputs, layer_outputs
model = LegacyModel(inputs=main_inputs, outputs=layer_outputs, name="sentence_embedding_model")
model.model_config = self.model_config
return model
Load Model, Optimizer , Trainer¶
Our Trainer expects model
, optimizer
and loss
to be a function.
We will use
Roberta
as the base model and pass it toSentence_Embedding_Model
, layer we built
We will use
in-batch
loss as the loss function, where every diagonal entry in the output is positive and rest is negative
# Load Model
def get_model(model_name, is_training, use_dropout):
"""Get Model"""
def model_fn():
model = RobertaModel.from_pretrained(model_name)
sentence_transformers_model = Sentence_Embedding_Model(model)
sentence_transformers_model = sentence_transformers_model.get_model()
return sentence_transformers_model
return model_fn
# Load Optimizer
def get_optimizer(learning_rate, examples, batch_size, epochs, use_constant_lr=False):
"""Get optimizer"""
steps_per_epoch = int(examples / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(0.1 * num_train_steps)
def optimizer_fn():
optimizer, learning_rate_fn = create_optimizer(learning_rate, num_train_steps, warmup_steps, use_constant_lr=use_constant_lr)
return optimizer
return optimizer_fn
# Load trainer
def get_trainer(distribution_strategy, num_gpus=0, tpu_address=None):
"""Get Trainer"""
trainer = Trainer(distribution_strategy, num_gpus=num_gpus, tpu_address=tpu_address)
return trainer
# Create loss
def in_batch_negative_loss():
def loss_fn(y_true_dict, y_pred_dict):
labels = tf.range(y_pred_dict['cls_logits'].shape[0])
cls_loss = cross_entropy_loss_for_classification(labels=labels, logits=y_pred_dict['cls_logits'])
mean_loss = cross_entropy_loss_for_classification(labels=labels, logits=y_pred_dict['mean_logits'])
result = {}
result['cls_loss'] = cls_loss
result['mean_loss'] = mean_loss
result['loss'] = (cls_loss + mean_loss)/2.0
return result
return loss_fn
Wandb Configuration¶
project = "TUTORIALS"
display_name = "roberta_quora_sentence_embedding"
wandb.init(project=project, name=display_name)
Zero-Shot on STS before Training¶
Lets evaluate how good
Roberta
is to capture sentence embeddings beforefine-tuning
with Quora.
This gives us an indication whether the model is learning something or not on downstream fine-tuning.
We use
CLS_OUTPUT
, pooler output ofRoberta
model as sentence embedding and evaluate usingpearson
andspearman
correlation.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
from scipy.stats import pearsonr, spearmanr
model = RobertaModel.from_pretrained(model_name)
sentence1_embeddings = []
sentence2_embeddings = []
sts_labels = []
for batch_inputs, batch_labels in tqdm.tqdm(validation_dataset):
left_inputs = {k.replace('_left', ''):v for k,v in batch_inputs.items() if 'left' in k}
right_inputs = {k.replace('_right', ''):v for k,v in batch_inputs.items() if 'right' in k}
left_outputs = model(left_inputs)
right_outputs = model(right_inputs)
# sentence 1 embeddings
sentence1_embeddings.append(left_outputs['cls_output'])
# sentence 2 embeddings
sentence2_embeddings.append(right_outputs['cls_output'])
sts_labels.append(batch_labels['score'])
sts_labels = tf.squeeze(tf.concat(sts_labels, axis=0), axis=1)
sentence1_embeddings = tf.concat(sentence1_embeddings, axis=0)
sentence2_embeddings = tf.concat(sentence2_embeddings, axis=0)
cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
dot_products = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]
eval_pearson_cosine, _ = pearsonr(sts_labels, cosine_scores)
eval_spearman_cosine, _ = spearmanr(sts_labels, cosine_scores)
eval_pearson_manhattan, _ = pearsonr(sts_labels, manhattan_distances)
eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)
eval_pearson_euclidean, _ = pearsonr(sts_labels, euclidean_distances)
eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)
eval_pearson_dot, _ = pearsonr(sts_labels, dot_products)
eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)
print("Cosine-Similarity :\tPearson: {:.4f}\tSpearman: {:.4f}".format(
eval_pearson_cosine, eval_spearman_cosine))
print("Manhattan-Distance:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
eval_pearson_manhattan, eval_spearman_manhattan))
print("Euclidean-Distance:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
eval_pearson_euclidean, eval_spearman_euclidean))
print("Dot-Product-Similarity:\tPearson: {:.4f}\tSpearman: {:.4f}".format(
eval_pearson_dot, eval_spearman_dot))
INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__roberta-base-no-mlm.main.9e4aa91ba5936c6ac98586f85c152831e421d0ec/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/roberta-base-no-mlm
12it [00:12, 1.08s/it]
Cosine-Similarity : Pearson: 0.4278 Spearman: 0.5293
Manhattan-Distance: Pearson: 0.4329 Spearman: 0.5120
Euclidean-Distance: Pearson: 0.4365 Spearman: 0.5125
Dot-Product-Similarity: Pearson: -0.0079 Spearman: -0.0050
import tqdm
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
from scipy.stats import pearsonr, spearmanr
class STSEvaluationCallback:
def __init__(self) -> None:
pass
def __call__(self, trainer_kwargs):
validation_dataset_distributed = iter(
trainer_kwargs["validation_dataset_distributed"]
)
model = trainer_kwargs["model"]
wandb = trainer_kwargs["wandb"]
step = trainer_kwargs["global_step"]
strategy = trainer_kwargs["strategy"]
epoch = trainer_kwargs["epoch"]
epochs = trainer_kwargs["epochs"]
validation_steps = trainer_kwargs["validation_steps"]
if validation_dataset_distributed is None:
raise ValueError(
"No validation dataset has been provided either in the trainer class, \
or when callback is initialized. Please provide a validation dataset"
)
@tf.function
def validate_run(dist_inputs):
batch_inputs, batch_labels = dist_inputs
model_outputs = model(batch_inputs)
s1_cls = model_outputs['left_cls_output']
s2_cls = model_outputs['right_cls_output']
s1_mean = model_outputs['left_mean_embeddings']
s2_mean = model_outputs['right_mean_embeddings']
return s1_cls, s2_cls, s1_mean, s2_mean, batch_labels['score']
S1_cls = []
S2_cls = []
S1_mean = []
S2_mean = []
sts_labels = []
# This is a hack to make tqdm to print colour bar
# TODO: fix it .
pbar = tqdm.trange(validation_steps, colour="magenta", unit="batch")
for step_counter in pbar:
dist_inputs = next(validation_dataset_distributed)
s1_cls, s2_cls, s1_mean, s2_mean, batch_scores = strategy.run(
validate_run, args=(dist_inputs,)
)
s1_cls = tf.concat(
trainer.distribution_strategy.experimental_local_results(s1_cls),
axis=0,
)
s2_cls = tf.concat(
trainer.distribution_strategy.experimental_local_results(s2_cls),
axis=0,
)
s1_mean = tf.concat(
trainer.distribution_strategy.experimental_local_results(s1_mean),
axis=0,
)
s2_mean = tf.concat(
trainer.distribution_strategy.experimental_local_results(s2_mean),
axis=0,
)
scores = tf.concat(
trainer.distribution_strategy.experimental_local_results(
batch_scores
),
axis=0,
)
S1_cls.append(s1_cls)
S2_cls.append(s2_cls)
S1_mean.append(s1_mean)
S2_mean.append(s2_mean)
sts_labels.append(scores)
pbar.set_description(
"Callback: Epoch {}/{} --- Step {}/{} ".format(
epoch, epochs, step_counter, validation_steps
)
)
sts_labels = tf.squeeze(tf.concat(sts_labels, axis=0), axis=1)
sentence1_embeddings = tf.concat(S1_cls, axis=0)
sentence2_embeddings = tf.concat(S2_cls, axis=0)
cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
dot_products = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]
eval_pearson_cosine, _ = pearsonr(sts_labels, cosine_scores)
eval_spearman_cosine, _ = spearmanr(sts_labels, cosine_scores)
eval_pearson_manhattan, _ = pearsonr(sts_labels, manhattan_distances)
eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)
eval_pearson_euclidean, _ = pearsonr(sts_labels, euclidean_distances)
eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)
eval_pearson_dot, _ = pearsonr(sts_labels, dot_products)
eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)
metrics_result = {'pearson_cosine_cls': eval_pearson_cosine,
'spearman_cosine_cls': eval_spearman_cosine,
'pearson_manhattan_cls': eval_pearson_manhattan,
'spearman_manhattan_cls': eval_spearman_manhattan,
'pearson_euclidean_cls': eval_pearson_euclidean,
'spearman_euclidean_cls': eval_spearman_euclidean,
'pearson_dot_cls': eval_pearson_dot,
'spearman_dot_cls': eval_spearman_dot}
sentence1_embeddings = tf.concat(S1_mean, axis=0)
sentence2_embeddings = tf.concat(S2_mean, axis=0)
cosine_scores = 1 - (paired_cosine_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy()))
manhattan_distances = -paired_manhattan_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
euclidean_distances = -paired_euclidean_distances(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())
dot_products = [np.dot(emb1, emb2) for emb1, emb2 in zip(sentence1_embeddings.numpy(), sentence2_embeddings.numpy())]
eval_pearson_cosine, _ = pearsonr(sts_labels, cosine_scores)
eval_spearman_cosine, _ = spearmanr(sts_labels, cosine_scores)
eval_pearson_manhattan, _ = pearsonr(sts_labels, manhattan_distances)
eval_spearman_manhattan, _ = spearmanr(sts_labels, manhattan_distances)
eval_pearson_euclidean, _ = pearsonr(sts_labels, euclidean_distances)
eval_spearman_euclidean, _ = spearmanr(sts_labels, euclidean_distances)
eval_pearson_dot, _ = pearsonr(sts_labels, dot_products)
eval_spearman_dot, _ = spearmanr(sts_labels, dot_products)
metrics_result_mean = {'pearson_cosine_mean': eval_pearson_cosine,
'spearman_cosine_mean': eval_spearman_cosine,
'pearson_manhattan_mean': eval_pearson_manhattan,
'spearman_manhattan_mean': eval_spearman_manhattan,
'pearson_euclidean_mean': eval_pearson_euclidean,
'spearman_euclidean_mean': eval_spearman_euclidean,
'pearson_dot_mean': eval_pearson_dot,
'spearman_dot_mean': eval_spearman_dot}
metrics_result.update(metrics_result_mean)
pbar.set_postfix(**metrics_result)
if wandb:
wandb.log(metrics_result, step=step)
return metrics_result
Set Hyperparameters and Configs¶
Set necessay hyperparameters.
Prepare
train dataset
,validation dataset
.Load
model
,optimizer
,loss
andtrainer
.
# Model configs
learning_rate = 2e-5
epochs = 3
model_checkpoint_dir = 'MODELS/roberta_quora_embeddings'
# Total train examples
steps_per_epoch = total_train_examples // batch_size
# model
model_fn = get_model(model_name, is_training=True, use_dropout=True)
# optimizer
optimizer_fn = get_optimizer(learning_rate, total_train_examples, batch_size, epochs)
# trainer (multi gpu strategy)
trainer = get_trainer(distribution_strategy='mirrored', num_gpus=2)
# loss
loss_fn = in_batch_negative_loss()
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Train :-)¶
Loss is coming down in epoch 1 itself.
Zershot evaluation after
epoch 1
shows that,pearson
andspearman
correlation increases to0.80
, which is significant improvemnet overRoberta
base model, where we got0.43
.
Without training on
STS-B
, we got a good evaluation score onSTS-B dev
using Zeroshot.
sts_callback = STSEvaluationCallback()
history = trainer.run(
model_fn=model_fn,
optimizer_fn=optimizer_fn,
train_dataset=train_dataset,
train_loss_fn=loss_fn,
epochs=epochs,
steps_per_epoch=steps_per_epoch,
model_checkpoint_dir=model_checkpoint_dir,
batch_size=batch_size,
validation_dataset=validation_dataset,
validation_loss_fn=loss_fn,
training_loss_names = ['cls_loss', 'mean_loss'],
validation_loss_names = ['cls_loss', 'mean_loss'],
steps_per_call=10,
callbacks=[sts_callback],
wandb=wandb
)
INFO:absl:Make sure `steps_per_epoch` should be less than or equal to number of batches in dataset.
INFO:absl:Policy: ----> float32
INFO:absl:Strategy: ---> <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f0b5c0287d0>
INFO:absl:Num GPU Devices: ---> 2
INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /home/jovyan/.cache/huggingface/hub/tftransformers__roberta-base-no-mlm.main.9e4aa91ba5936c6ac98586f85c152831e421d0ec/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/roberta-base-no-mlm
INFO:absl:Using linear optimization warmup
INFO:absl:Using Adamw optimizer
INFO:absl:No ❌❌ checkpoint found in MODELS/roberta_quora_embeddings
Train: Epoch 1/4 --- Step 10/3158 --- total examples 0 , trainable variables 199: 0%| | 0/315 [00:00<?, ?batch /s]
INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 198 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Train: Epoch 1/4 --- Step 3150/3158 --- total examples 401920 , trainable variables 199: 100%|██████████| 315/315 [27:15<00:00, 5.19s/batch , _runtime=1803, _timestamp=1.65e+9, cls_loss=0.504, learning_rate=1.34e-5, loss=0.493, mean_loss=0.482]
INFO:absl:Model saved at epoch 1 at MODELS/roberta_quora_embeddings/ckpt-1
0%| | 0/12 [00:00<?, ?batch /s]
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Validation: Epoch 1/4 --- Step 0/12 : 8%|▊ | 1/12 [00:11<02:07, 11.57s/batch , cls_loss=2.63, loss=2.8, mean_loss=2.96]
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Validation: Epoch 1/4 --- Step 11/12 : 100%|██████████| 12/12 [00:24<00:00, 2.07s/batch , cls_loss=1.62, loss=1.65, mean_loss=1.68]
INFO:absl:Validation result at epcoh 1 and global step 3150 is {'cls_loss': 1.6162163, 'mean_loss': 1.6796235, 'loss': 1.6479198}
INFO:absl:Callbacks in progress at epoch end 1 . . . .
Callback: Epoch 1/3 --- Step 11/12 : 100%|██████████| 12/12 [00:21<00:00, 1.76s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8199941582015672, 'spearman_cosine_cls': 0.8220972132455343, 'pearson_manhattan_cls': 0.8184392097896854, 'spearman_manhattan_cls': 0.8164570492108482, 'pearson_euclidean_cls': 0.8190927383440411, 'spearman_euclidean_cls': 0.8172409394315345, 'pearson_dot_cls': 0.7828011052166998, 'spearman_dot_cls': 0.7807641366784325, 'pearson_cosine_mean': 0.8151801531088095, 'spearman_cosine_mean': 0.8162854946579012, 'pearson_manhattan_mean': 0.8145520799669964, 'spearman_manhattan_mean': 0.8123405339144811, 'pearson_euclidean_mean': 0.8148764479876582, 'spearman_euclidean_mean': 0.8132354135356057, 'pearson_dot_mean': 0.7381403305760383, 'spearman_dot_mean': 0.7337835384203213, '_timestamp': 1647999708, '_runtime': 1854} at epoch 1
wandb: WARNING Step must only increase in log calls. Step 1 < 3150; dropping {'pearson_cosine_cls': 0.8199941582015672, 'spearman_cosine_cls': 0.8220972132455343, 'pearson_manhattan_cls': 0.8184392097896854, 'spearman_manhattan_cls': 0.8164570492108482, 'pearson_euclidean_cls': 0.8190927383440411, 'spearman_euclidean_cls': 0.8172409394315345, 'pearson_dot_cls': 0.7828011052166998, 'spearman_dot_cls': 0.7807641366784325, 'pearson_cosine_mean': 0.8151801531088095, 'spearman_cosine_mean': 0.8162854946579012, 'pearson_manhattan_mean': 0.8145520799669964, 'spearman_manhattan_mean': 0.8123405339144811, 'pearson_euclidean_mean': 0.8148764479876582, 'spearman_euclidean_mean': 0.8132354135356057, 'pearson_dot_mean': 0.7381403305760383, 'spearman_dot_mean': 0.7337835384203213, '_timestamp': 1647999708, '_runtime': 1854}.
Train: Epoch 2/4 --- Step 3150/3158 --- total examples 805120 , trainable variables 199: 100%|██████████| 315/315 [26:13<00:00, 5.00s/batch , _runtime=3428, _timestamp=1.65e+9, cls_loss=0.304, learning_rate=6.71e-6, loss=0.305, mean_loss=0.305]
INFO:absl:Model saved at epoch 2 at MODELS/roberta_quora_embeddings/ckpt-2
Validation: Epoch 2/4 --- Step 11/12 : 100%|██████████| 12/12 [00:05<00:00, 2.27batch /s, cls_loss=1.78, loss=1.82, mean_loss=1.85]
INFO:absl:Validation result at epcoh 2 and global step 6300 is {'cls_loss': 1.778288, 'mean_loss': 1.8532048, 'loss': 1.8157464}
INFO:absl:Callbacks in progress at epoch end 2 . . . .
Callback: Epoch 2/3 --- Step 11/12 : 100%|██████████| 12/12 [00:19<00:00, 1.61s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8082112012752523, 'spearman_cosine_cls': 0.8088788767212841, 'pearson_manhattan_cls': 0.7977193919551161, 'spearman_manhattan_cls': 0.79662337716043, 'pearson_euclidean_cls': 0.7982407615058535, 'spearman_euclidean_cls': 0.7970524557483568, 'pearson_dot_cls': 0.7645641878510724, 'spearman_dot_cls': 0.7678639160320804, 'pearson_cosine_mean': 0.8030011391493671, 'spearman_cosine_mean': 0.8044760711917577, 'pearson_manhattan_mean': 0.7959895612836713, 'spearman_manhattan_mean': 0.7952571982816723, 'pearson_euclidean_mean': 0.7974056893147314, 'spearman_euclidean_mean': 0.7970287600024667, 'pearson_dot_mean': 0.7324014153178778, 'spearman_dot_mean': 0.7335963354441554, '_timestamp': 1648001312, '_runtime': 3458} at epoch 2
wandb: WARNING Step must only increase in log calls. Step 2 < 6300; dropping {'pearson_cosine_cls': 0.8082112012752523, 'spearman_cosine_cls': 0.8088788767212841, 'pearson_manhattan_cls': 0.7977193919551161, 'spearman_manhattan_cls': 0.79662337716043, 'pearson_euclidean_cls': 0.7982407615058535, 'spearman_euclidean_cls': 0.7970524557483568, 'pearson_dot_cls': 0.7645641878510724, 'spearman_dot_cls': 0.7678639160320804, 'pearson_cosine_mean': 0.8030011391493671, 'spearman_cosine_mean': 0.8044760711917577, 'pearson_manhattan_mean': 0.7959895612836713, 'spearman_manhattan_mean': 0.7952571982816723, 'pearson_euclidean_mean': 0.7974056893147314, 'spearman_euclidean_mean': 0.7970287600024667, 'pearson_dot_mean': 0.7324014153178778, 'spearman_dot_mean': 0.7335963354441554, '_timestamp': 1648001312, '_runtime': 3458}.
Train: Epoch 3/4 --- Step 3150/3158 --- total examples 1208320 , trainable variables 199: 100%|██████████| 315/315 [26:08<00:00, 4.98s/batch , _runtime=5027, _timestamp=1.65e+9, cls_loss=0.275, learning_rate=6.02e-8, loss=0.278, mean_loss=0.282]
INFO:absl:Model saved at epoch 3 at MODELS/roberta_quora_embeddings/ckpt-3
Validation: Epoch 3/4 --- Step 11/12 : 100%|██████████| 12/12 [00:05<00:00, 2.27batch /s, cls_loss=2.3, loss=2.29, mean_loss=2.28]
INFO:absl:Validation result at epcoh 3 and global step 9450 is {'cls_loss': 2.2950742, 'mean_loss': 2.2835078, 'loss': 2.2892911}
INFO:absl:Callbacks in progress at epoch end 3 . . . .
Callback: Epoch 3/3 --- Step 11/12 : 100%|██████████| 12/12 [00:18<00:00, 1.56s/batch]
INFO:absl:Callback score {'pearson_cosine_cls': 0.8118276970012877, 'spearman_cosine_cls': 0.8110754654257855, 'pearson_manhattan_cls': 0.7893045752002403, 'spearman_manhattan_cls': 0.7901086696302247, 'pearson_euclidean_cls': 0.789695365243242, 'spearman_euclidean_cls': 0.7900715861621009, 'pearson_dot_cls': 0.7764208929832053, 'spearman_dot_cls': 0.7831285760325771, 'pearson_cosine_mean': 0.8074910790403766, 'spearman_cosine_mean': 0.8084473888790257, 'pearson_manhattan_mean': 0.792546459118103, 'spearman_manhattan_mean': 0.794987013041834, 'pearson_euclidean_mean': 0.7943711503130662, 'spearman_euclidean_mean': 0.7970291923871069, 'pearson_dot_mean': 0.7619295041302732, 'spearman_dot_mean': 0.7644860560497375, '_timestamp': 1648002910, '_runtime': 5056} at epoch 3
wandb: WARNING Step must only increase in log calls. Step 3 < 9450; dropping {'pearson_cosine_cls': 0.8118276970012877, 'spearman_cosine_cls': 0.8110754654257855, 'pearson_manhattan_cls': 0.7893045752002403, 'spearman_manhattan_cls': 0.7901086696302247, 'pearson_euclidean_cls': 0.789695365243242, 'spearman_euclidean_cls': 0.7900715861621009, 'pearson_dot_cls': 0.7764208929832053, 'spearman_dot_cls': 0.7831285760325771, 'pearson_cosine_mean': 0.8074910790403766, 'spearman_cosine_mean': 0.8084473888790257, 'pearson_manhattan_mean': 0.792546459118103, 'spearman_manhattan_mean': 0.794987013041834, 'pearson_euclidean_mean': 0.7943711503130662, 'spearman_euclidean_mean': 0.7970291923871069, 'pearson_dot_mean': 0.7619295041302732, 'spearman_dot_mean': 0.7644860560497375, '_timestamp': 1648002910, '_runtime': 5056}.
Visualize the Tensorboard¶
%load_ext tensorboard
%tensorboard --logdir MODELS/roberta_quora_embeddings/logs
Load Trained Model for Testing and Save it as serialzed model¶
To get good sentence embedding , we need only
Roberta
model, which has been used as thebase
forSentence_Embedding_Model
.
# Save serialized version of the model
# Note: Ignore checkpoint warnings, it is because we save optimizer with checkpoint
# while we restoring, we take only model.
model_fn = get_model(model_name, is_training=False, use_dropout=False)
model = model_fn()
model.load_checkpoint(model_checkpoint_dir)
# Roberta base (model.layers[-1] is Sentence_Embedding_Model )
model = model.layers[-1].model
model.save_transformers_serialized('{}/saved_model/'.format(model_checkpoint_dir))
WARNING:absl:Found untraced functions such as word_embeddings_layer_call_fn, word_embeddings_layer_call_and_return_conditional_losses, type_embeddings_layer_call_fn, type_embeddings_layer_call_and_return_conditional_losses, positional_embeddings_layer_call_fn while saving (showing 5 of 870). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: MODELS/roberta_quora_embeddings/saved_model/assets
INFO:tensorflow:Assets written to: MODELS/roberta_quora_embeddings/saved_model/assets
Model Serialization (Production)¶
Lets see how we can use this model to extract sentence embeddings
Print top K similar sentences from our embeddings from Quora Dataset
# Load serialized model
loaded = tf.saved_model.load("{}/saved_model/".format(model_checkpoint_dir))
model = loaded.signatures['serving_default']
# Take 100000 sentences from Quora and calculate embeddings of that
quora_questions = []
for item in dataset['train']:
quora_questions.extend(item['questions']['text'])
quora_questions = list(set(quora_questions))
quora_questions = quora_questions[:100000] # Take 100000
print("Total sentences {}".format(len(quora_questions)))
# Prepare Dataset
quora_dataset = tf.data.Dataset.from_tensor_slices({'questions': quora_questions})
quora_dataset = quora_dataset.batch(batch_size, drop_remainder=False)
Total sentences 100000
Quora Sentence Embeddings¶
quora_sentence_embeddings = []
for batch_questions in tqdm.tqdm(quora_dataset):
batch_questions = batch_questions['questions'].numpy().tolist()
batch_questions = [q.decode() for q in batch_questions]
# Tokenize
quora_inputs = tokenizer(batch_questions, max_length=max_sequence_length, padding=True, truncation=True, return_tensors='tf')
quora_inputs['input_mask'] = quora_inputs['attention_mask']
quora_inputs['input_type_ids'] = tf.zeros_like(quora_inputs['input_ids'])
del quora_inputs['attention_mask'] # we dont want this
model_outputs = model(**quora_inputs)
quora_sentence_embeddings.append(model_outputs['cls_output'])
# Pack and Normalize
quora_sentence_embeddings = tf.nn.l2_normalize(tf.concat(quora_sentence_embeddings, axis=0), axis=-1)
100%|██████████| 782/782 [03:30<00:00, 3.71it/s]
Most Similar Sentences¶
def most_similar(input_question, top_k=10):
quora_inputs = tokenizer([input_question], max_length=max_sequence_length, padding=True, truncation=True, return_tensors='tf')
quora_inputs['input_mask'] = quora_inputs['attention_mask']
quora_inputs['input_type_ids'] = tf.zeros_like(quora_inputs['input_ids'])
del quora_inputs['attention_mask'] # we dont want this
model_outputs = model(**quora_inputs)
query_vector = model_outputs['cls_output']
query_vector = tf.nn.l2_normalize(query_vector, axis=1)
scores = tf.matmul(query_vector, quora_sentence_embeddings, transpose_b=True)
top_k_values = tf.nn.top_k(scores, k=top_k)
for i in range(top_k):
best_index = top_k_values.indices.numpy()[0][i]
best_prob = top_k_values.values.numpy()[0][i]
print(quora_questions[best_index], '-->', best_prob)
input_question = 'What is the best way to propose a girl?'
most_similar(input_question)
How should I propose a girl? --> 0.9225553
How do I propose to a girl? --> 0.8723614
Which is the most romantic way to propose a girl? --> 0.8557819
How do I propose a girl for sex? --> 0.80544627
How did you propose your girlfriend? --> 0.69494146
What are some of the best and unique ways to propose marriage? --> 0.6611091
How can I propose to my crush? --> 0.64724606
If I want to propose to a girl should I give her hints in advance? --> 0.6309003
What doesit take for a man to propose? --> 0.6253518
What is the right time to propose someone ? --> 0.5932445
input_question = 'How can I start learning Deep Learning?'
most_similar(input_question)
What's the most effective way to get started with Deep Learning? --> 0.83670104
How do I learn deep learning in 1 month? --> 0.7423597
Why is deep learning so important in machine learning? --> 0.7327932
Does Quora use Deep Learning? --> 0.7260519
Should a machine learning beginner go straight for deep learning? --> 0.719143
Where should I start for machine learning? --> 0.71324116
How do i get started on machine learning? --> 0.7123989
I am New to Deep Learning. How do I start with Python? --> 0.710938
How do I start learning machine learning? --> 0.7106862
What is deep learning? How is related to AI and machine learning? --> 0.70124084
input_question = 'Best tourist destinations in India'
most_similar(input_question)
What are the must-visit and affordable tourist destinations in India? --> 0.8237094
What is the most overrated tourist destination in India? --> 0.7374817
What is the best sex tourism destination in India? --> 0.73366314
What are the most popular tourist destinations? --> 0.7160436
What are the best destination for a solo traveler in India? --> 0.7078299
What is the best holiday destination? --> 0.675949
Which places I should not visit in India as a Indian? --> 0.6656152
What are the best places to go as a tourist? --> 0.66551954
Which are some best places to visit in India? --> 0.66457677
Which is your best holiday destination? --> 0.6640895
input_question = 'Why classical music is so relaxing?'
most_similar(input_question)
What is your favourite piece of classical music and why? --> 0.75282526
What are the benefits of listening to classical music? --> 0.7361862
Why do some people only listen to classical music? --> 0.7289536
Which music is the best for relaxation? --> 0.6762159
Why is classical music better than most pop music? --> 0.6651089
What are some classical and operant conditioning in education? --> 0.64240026
Classical music in movies? --> 0.6344438
Which classic music is this? --> 0.59156764
Which ones are some of the most soothing tunes composed on a piano? --> 0.57644486
What are the differences between Hindustani classical music and Carnatic music? --> 0.57415533