I was doing some experiments with the CLIP's visual transformer encoder output (clip-ViT-B-32
). So basically given the same scene or image, it should output almost same image feature vector given it's a semantics model. But looks like it is very sensitive to illumination and lighting conditions which makes me wonder and the percentage of similarity between the images below are much lower than expected (surprisingly it says 89.45% similar)
我正在用剪辑的视觉转换器编码器输出(CLIP-VIT-B-32)做一些实验。因此,在给定相同的场景或图像的情况下,它应该输出几乎相同的图像特征向量(假设它是一个语义模型)。但看起来它对光照和光照条件非常敏感,这让我感到奇怪,下面图像之间的相似性百分比比预期低得多(令人惊讶的是,相似性为89.45%)。
Why is that? Is there any ways/models which are less sensitive to illumination changes and are more semantic based?
为什么会这样呢?有没有对光照变化不那么敏感、更基于语义的方法/模型?
from sentence_transformers import SentenceTransformer, util
#......
model = SentenceTransformer('clip-ViT-B-32')
encoded_image = model.encode(image, batch_size=128, convert_to_tensor=True, show_progress_bar=True)
# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
更多回答
优秀答案推荐
我是一名优秀的程序员,十分优秀!