gpt4 book ai didi

php - Libpuzzle 索引数百万张图片?

转载 作者:IT老高 更新时间:2023-10-28 23:49:07 25 4
gpt4 key购买 nike

它是关于来自 Mr. Frank Denis 的 libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle )。我想了解如何在我的 mysql 数据库中索引和存储数据。 vector的生成是绝对没问题的。

例子:

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
echo "Pictures are looking similar\n";
} else {
echo "Pictures are different, distance=$d\n";
}

这对我来说很清楚 - 但现在当我有大量图片 >1.000.000 时我该如何工作?我计算向量并将其与文件名一起存储在数据库中?现在如何找到相似的图片?如果我将每个向量存储在 mysql 中,我必须打开每个记录并使用 puzzle_vector_normalized_distance 函数计算距离。该过程需要很多时间(打开每个数据库条目 - 将其抛出函数,...)

我阅读了 lib puzzle libaray 中的自述文件,发现了以下内容:

Will it work with a database that has millions of pictures?

A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.

Similar signatures share identical “words”, ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.

Indexing through words and positions also makes it easy to split the data into multiple tables and servers.

So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.

我还找到了关于索引的描述:

------------------------ INDEXING ------------------------

How to quickly find similar pictures, if they are millions of records?

The original paper has a simple, yet efficient answer.

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

Here's a very basic sample database schema:

+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+

+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+

I'd recommend splitting at least the "words" table into multiple tables and/or servers.

By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().

我认为压缩是错误的方式,因为我必须在比较之前解压缩每个向量。

我现在的问题是 - 处理数百万张图片的方式是什么以及如何以快速有效的方式比较它们。我不明白“向量的切割”如何帮助我解决我的问题。

非常感谢 - 也许我可以在这里找到使用 libpuzzle libaray 的人。

干杯。

最佳答案

那么,让我们看看他们给出的例子并尝试扩展。

假设您有一个表,用于存储与每个图像相关的信息(路径、名称、描述等)。在该表中,您将包含一个用于压缩签名的字段,该字段在您最初填充数据库时计算并存储。让我们这样定义该表:

CREATE TABLE images (
image_id INTEGER NOT NULL PRIMARY KEY,
name TEXT,
description TEXT,
file_path TEXT NOT NULL,
url_path TEXT NOT NULL,
signature TEXT NOT NULL
);

当您最初计算签名时,您还将计算签名中的一些单词:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
$words[] = substr($cvec, $i, $wordlen);
}

现在您可以将这些词放入表中,定义如下:

CREATE TABLE img_sig_words (
image_id INTEGER NOT NULL,
sig_word TEXT NOT NULL,
FOREIGN KEY (image_id) REFERENCES images (image_id),
INDEX (image_id, sig_word)
);

现在您插入到该表中,在找到该词的位置索引之前添加,以便您知道何时匹配一个词,它在签名中的相同位置匹配:

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
$sig_word = $index.'__'.$word;
$dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
'$sig_word')"); // figure a suitably defined db abstraction layer...
}

这样你的数据就初始化好了,你可以相对容易地抓取带有匹配词的图像:

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
i.file_path, i.url_path, i.signature ORDER BY strength DESC");

您可以通过添加要求最小强度HAVING 子句来改进查询,从而进一步减少您的匹配集。

我不保证这是最有效的设置,但它应该大致可以实现您正在寻找的功能。

基本上,以这种方式拆分和存储单词可以让您进行粗略的距离检查,而无需对签名运行专门的函数。

关于php - Libpuzzle 索引数百万张图片?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9703762/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com