1 Introduction

This is a personal research project. So far it’s only an idea, essentially a research proposal.

CLIP is still a state-of-the-art approach for training vision-language encoders. Among the many related models and papers released this year, Meta’s “Perception Encoder” for example - essentially a scaled vanilla CLIP model - shows that the features learned through contrastive alignment between vision and language can surpass those of other (vision only) pre-training techniques such as DINOv2.

Yet CLIP still has weaknesses too. In particular, it’s known to have weak image-object binding and compositional understanding. I believe to some degree, this is because CLIP misses a key part of reality by assuming that images and captions can only be similar or not, with nothing in between. In contrast, images in reality can of course be similar in one aspect despite being different in another (or overall). Let’s call this “conditional similarity”, referring to two images that are similar conditional on a concept.

Here is a simple example:

Caption of image A: “a car turning left”
Caption of image B: “she’s a left-handed writer”

Clearly these two are very different images but nevertheless, they both encode a shared concept, i.e. “left”, and so should not just be considered a negative pair. Or put differently, when conditioned on “left”, they can actually be considered a positive pair. On the other hand, when conditioned on “car” for example, they would not be considered similar, so this would be a negative conditional pair. In fact, we expect these nuances to be reflected in the image/language embeddings, but CLIP simply does not account for this (the two images would just be considered a negative pair).

As a side note, concepts do not just have to be words. More broadly and simply (for now), think of any n-gram as a potential concept (so even “by the door”, “yellow umbrella”, or “stand left of” could all be concepts, and perhaps you can already see how this might relate to addressing CLIP’s weaknesses).

To summarize, the idea for CCLIP is to learn not just from positive and negative image-caption pairs as in CLIP, but also from conditional pairs (also both positives and negatives). I believe this is a valuable signal and an additional “gradient force” pulling/pushing the embeddings in meaningful directions.

To the best of my knowledge, the ideas that I’m sharing in this proposal have not been tried or published before, so I wanted to write down and formalize my ideas. I think they are simple but elegant and I hope that by doing this, I can contribute to performance and efficiency improvements in the area of vision-language pre-training.

2 Training

(Pre-)training CCLIP would look very similarly to CLIP, by matching images with their captions.

For example, consider SigLip-like pre-training, which essentially reduces the problem to binary classification per pair. The difference in CCLIP is now simply that we would have (exponentially?) more pairs to learn from per batch with the addition of the conditional pairs. CLIP alone often requires very large batch sizes to work, whereas CCLIP I believe could work with smaller but well curated (!) batches (more on both creating pairs and “conditioning” the model later).

One thing to emphasize is that just like in CLIP, every pair still consists of an image and a caption, but unlike in CLIP, an image and a “wrong” caption can still conditionally match. This is the additional “gradient force” that pulls image A towards caption B if they have a concept in common (eg “left”). And indirectly through that, images A and B will also become closer, but only in the direction of the shared concept.

The image below shows an example batch and how many (normal and conditional) pairs this allows us to create.

<IMAGE to be added>

3 Architecture & Gating

The overall architecture is the same as in CLIP, with two separate encoder towers, one for the image, and one for the text. The difference is that we would also send our concepts through the text encoder (independently of the caption) and use them to for conditioning.

The “conditioning” operation itself could be, for example, an element-wise dot product between the image (and caption) embeddings and the concept embedding, which would act like a gating mechanism, reinforcing or diminishing (ir)relevant features contained in the embeddings (this is the key idea here).

It would be computed like this \[ cosine\_similarity(x \odot c, t \odot c) \]

where \(x\) is the image embedding produced by the vision encoder, \(c\) is the concept embedding, and \(t\) is the text/caption embedding, both produced by the text encoder.

Visually you can see the architecture below.

<IMAGE to be added>

4 Reframing as “masked captions”

I mentioned that concepts can be any n-grams shared (or not) between two captions. This helped me later realize that it’s actually possible (and I’d argue helpful) to reframe and generalize this to “masked captions”.

The main idea is to find any common structure between two captions, and simply mask the rest. This masked caption then describes what is shared between the two images (or rather an image and a caption since that is what we train on).

For example:

Caption of image A: “a car turning left”
Caption of image B: “she’s a left-handed writer”
Shared masked caption: <|M|> left <|M|>

The advantage is that this would also allow more complex shared structures:

Caption of image A: “a grey cat sitting in front of a black dog”
Caption of image B: “there is a grey wall in front of the black window”
Shared masked caption: <|M|> a grey <|M|> in front of <|M|> black <|M|>

From this mask we can learn that both images depict “something grey in front of something black”, and that’s (in my opinion) clearly useful enough to learn from, from a generalization point of view. But at the same time, there is obviously a lot more to think through in detail here, especially when it comes to creating pairs.

5 Hard negative mining

Creating positive pairs should be fairly straightforward and could require only the captions. It’s also very scalable (which is a very important aspect). For example, one approach could be to first use BM25 to find images with an large overlap of concepts/terms, and then create the shared masked caption for each pair (btw when I say image here, I sometimes implicitly mean the image’s caption).

It’s also possible to start with an anchor image, extract a concept, then find one (or more) other images with that concept, and finally compute the masked caption (and repeat).

There are probably many more ways, but my general intuition is that this should to be done before the training starts, ie to prepare a dataset of batches offline, not only since curation may require quite a bit of computation, but in particular because we want batches with enough (and meaningful) overlap of concepts/masked captions. If that’s not the case (eg choosing random images with little in common), you’ll mostly just train a vanilla CLIP model.

Finding reliable (hard) negatives on the other hand, is probably much more difficult due to the fact that captions usually do not describe their image exhaustively. Simply using a concept that is not shared (ie present in only one of the pairs) would be the easiest approach, but there is a high chance of false negatives.

For example (going back to only concepts here rather than masked captions):

Caption of image A: “the ship is out on the sea”
Caption of image B: “a cruise ship”
Positive (ie shared) concept: “ship”
False negative concept: “sea” or “cruise” (if present in both images)

How exactly to solve this I don’t know. I definitely need to think about this (critical) part more.

I do have some more elaborate (ie perhaps less scalable) ideas to create both positive and in particular negative pairs using the help of a language model. I include them here just as a reference, but this approach goes against my idea of keeping things simple and scalable for CCLIP without involvement of another model (because then it becomes some form of distillation or student-teacher relationship potentially).

Anyway, for example more reliable negatives could be created as follows:

Caption: “a ship out on the sea”
- Find images that contain “ship” (positive conditional) but also “in the port” or “under construction” (they are negative conditionals as they probably are not shared)
Caption: “a day on the beach under the sun”
- Find images that have “beach” (positive conditional) but also “cloudy”, “rain”, or “night” (negative conditionals)

Or another approach could be to generate pairs something like this:

Start with a concept, say “Eiffel Tower”
- Find other “iconic structures made of metal” (positive conditional), e.g. “The London Eye”, “Tokyo Tower” (positives)
Concept “playing football”
- Find other “playing team sports” (positive conditional), e.g. “playing basketball”, “playing rugby” (positives)

I’m also wondering if perhaps it’s possible to learn only from positive pairs (there are some research papers having done that I believe, but probably just don’t understand how at the moment).

6 Conclusion

Imagine you didn’t know what a car is. Could you learn it from a single image? Probably not. If you see a red Ferrari you may think all cars may have that specific color and shape. Only after seeing multiple images of cars can you know what makes a car and what is just noise. That’s how the gradient should learn from a batch too.

Now imagine a single batch containing many shared concepts (eg found via BM25), eg “black car”, “red car”, “car”, “red apple”, “car turning left”, “arrow pointing left”, “left-handed writer”, “car in front of”, etc. It will have many common and distinct concepts.

The idea of CCLIP is to learn from all these partial/conditional similarities within each constructed batch, which should mean that the gradient can make good and nuanced updates (Meta’s paper acknowledged that within-batch learning is crucial, hence the large batch size usually required for CLIP). If this works, this could make vision-language pre-training much more efficient while potentially also fixing many of CLIP’s issues like object-attribute binding (eg “black car”) or compositional understanding (eg “in front of”) - simply because this is exactly what the model is trained to understand/differentiate.

I think the idea is simple, very much like CLIP is. It also preserves all of CLIP’s great advantages like scalability and un/self-supervised learning. There’s still a lot more to think through (especially about hard negative mining, but not only). So far this is only a rough idea and I do not know if it would work or not. But I hope that I can work on implementing and empirically validating my ideas next.