in

OpenAI CLIP: Connecting Text and Images



CLIP is a model that connects Text and Images. It has been pre-trained using 400 million (image, text) pairs for task of predicting which caption goes with which image. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
It has been tested on 30+ CV tasks like OCR, action recognition in videos, geo-localization, etc. zero-shot CLIP is often equivalent to fully supervised baseline. E.g., 0-shot CLIP is equivalent to ResNet-50 with 1.28M train set on ImageNet. Eight models show smooth accuracy improvements with scale.

In this video, I will briefly provide an overview of CLIP, its pretraining data, its pretraining architecture. We will also talk about its zero-shot performance, robustness to distribution shifts, and comparison to human performance.

Here is the agenda:

00:00:00 What is OpenAI CLIP?
00:02:09 What is contrastive pretraining? And why?
00:05:20 What dataset was used for contrastive pretraining?
00:06:30 What is the architecture of CLIP models?
00:08:38 How is CLIP used for zero-shot classification?
00:12:02 How does 0-shot CLIP perform compared to equivalent supervised classifier?
00:17:36 How do CLIP representations perform compared to other ImageNet trained representations?
00:19:46 CLIP’s robustness to Natural Distribution Shifts
00:21:23 Comparison to Human Performance
00:23:58 Bias
00:27:38 Image classification examples.

For more details, please look at https://arxiv.org/pdf/2103.00020.pdf and https://openai.com/blog/clip/

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. “Learning transferable visual models from natural language supervision.” In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.

Creating Business Solutions with OpenAI and Chat-GPT