gpt4 book ai didi

CLIP's Visual Transformer image encoder output(Clip的视觉转换器图像编码器输出)

转载 作者:bug小助手 更新时间:2023-10-25 11:21:00 26 4
gpt4 key购买 nike



I was doing some experiments with the CLIP's visual transformer encoder output (clip-ViT-B-32). So basically given the same scene or image, it should output almost same image feature vector given it's a semantics model. But looks like it is very sensitive to illumination and lighting conditions which makes me wonder and the percentage of similarity between the images below are much lower than expected (surprisingly it says 89.45% similar)

我正在用剪辑的视觉转换器编码器输出(CLIP-VIT-B-32)做一些实验。因此,在给定相同的场景或图像的情况下,它应该输出几乎相同的图像特征向量(假设它是一个语义模型)。但看起来它对光照和光照条件非常敏感,这让我感到奇怪,下面图像之间的相似性百分比比预期低得多(令人惊讶的是,相似性为89.45%)。


Why is that? Is there any ways/models which are less sensitive to illumination changes and are more semantic based?

为什么会这样呢?有没有对光照变化不那么敏感、更基于语义的方法/模型?


from sentence_transformers import SentenceTransformer, util
#......
model = SentenceTransformer('clip-ViT-B-32')
encoded_image = model.encode(image, batch_size=128, convert_to_tensor=True, show_progress_bar=True)

# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)

enter image description here enter image description here


更多回答
优秀答案推荐
更多回答

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com