--- jupytext: formats: ipynb,md:myst text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.13.5 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- # Benchmark CLIP * This notebook measures the performance of CLIP Image Encoder on Imagenetv2 dataset using ```Pytorch``` and ```Tensorflow``` ```{code-cell} ipython3 ``` ### PyTorch CLIP Model * With ```batch_size=64```, model takes ```54 seconds``` to process ```~10k``` images. ```{code-cell} ipython3 import numpy as np import torch import clip from tqdm.notebook import tqdm # Load Model and preprocess for images model, preprocess = clip.load("ViT-B/32") device = torch.device('cuda') model.to(device) model.eval() ! pip install git+https://github.com/modestyachts/ImageNetV2_pytorch from imagenetv2_pytorch import ImageNetV2Dataset images = ImageNetV2Dataset(transform=preprocess) loader = torch.utils.data.DataLoader(images, batch_size=64, num_workers=2) all_data = [] with torch.no_grad(): for i, (images, target) in enumerate(tqdm(loader)): images = images.cuda() image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True) ``` ```{code-cell} ipython3 ``` ### Load CLIP Model tf-transformers ```{code-cell} ipython3 from tf_transformers.models.clip import CLIPModel, CLIPFeatureExtractorTF from transformers import CLIPTokenizer import tensorflow as tf import tqdm # Model model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", return_layer=True) # Get text and image encoder out text_encoder = model.text_encoder image_encoder = model.image_encoder ``` ```{code-cell} ipython3 ``` ### TF with clip preprocess (pt to tf data) * With ```batch_size=64```, model takes ```54 seconds``` to process ```~10k``` images. ```{code-cell} ipython3 images = ImageNetV2Dataset(transform=preprocess) loader = torch.utils.data.DataLoader(images, batch_size=64, num_workers=2) for i, (images, target) in tqdm.tqdm(enumerate(loader)): images = {'input_pixels': tf.transpose(tf.convert_to_tensor(images.numpy()), [0, 2, 3, 1] )} outputs = image_encoder(images) ``` ### TF with tf.io preprocess (Preprocess on the fly) * With ```batch_size=64```, model takes ```17 seconds``` to process ```~10k``` images. This is ```3x``` times faster. ```{code-cell} ipython3 img_height = 224 img_width = 224 rescaler = tf.keras.layers.Rescaling(scale=1.0/255.0) mean = [0.48145466, 0.4578275, 0.40821073] variance = [0.26862954, 0.26130258, 0.27577711] def standardize(image_data): image_data -= tf.constant([mean]) image_data /= tf.constant([variance]) return image_data def read_process_resize(image_path: str): """Read, decode and process""" img = tf.io.read_file(image_path) img = tf.io.decode_jpeg(img, channels=3) img = tf.image.resize(img, [img_height, img_width]) result = {} result['image_path'] = image_path result['input_pixels'] = standardize(rescaler(img)) result['label'] = tf.strings.split(image_path, '/')[2] # string return result image_files = tf.constant(tf.io.gfile.glob("Imagenet/imagenetv2-matched-frequency-format-val/*/*.jpeg")) image_dataset = tf.data.Dataset.from_tensor_slices(image_files) image_dataset = image_dataset.map(read_process_resize, num_parallel_calls=tf.data.AUTOTUNE) batch_size = 64 image_dataset = image_dataset.batch(batch_size, drop_remainder=False) for (index, item) in tqdm.tqdm(enumerate(image_dataset)): image_features = image_encoder(item)['cls_output'] image_features = tf.nn.l2_normalize(image_features, axis=-1) ``` ```{code-cell} ipython3 ``` ```{code-cell} ipython3 ```