php - Libpuzzle 索引数百万张图片？-6ren

php - Libpuzzle 索引数百万张图片？

转载作者：IT老高更新时间：2023-10-28 23:49:07

25

4

它是关于来自 Mr. Frank Denis 的 libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle )。我想了解如何在我的 mysql 数据库中索引和存储数据。 vector的生成是绝对没问题的。

例子:

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
  echo "Pictures are looking similar\n";
} else {
  echo "Pictures are different, distance=$d\n";
}

这对我来说很清楚 - 但现在当我有大量图片 >1.000.000 时我该如何工作？我计算向量并将其与文件名一起存储在数据库中？现在如何找到相似的图片？如果我将每个向量存储在 mysql 中，我必须打开每个记录并使用 puzzle_vector_normalized_distance 函数计算距离。该过程需要很多时间(打开每个数据库条目 - 将其抛出函数，...)

我阅读了 lib puzzle libaray 中的自述文件，发现了以下内容:

Will it work with a database that has millions of pictures?

A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.

Similar signatures share identical “words”, ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.

Indexing through words and positions also makes it easy to split the data into multiple tables and servers.

So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.

我还找到了关于索引的描述:

------------------------ INDEXING ------------------------

How to quickly find similar pictures, if they are millions of records?

The original paper has a simple, yet efficient answer.

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

Here's a very basic sample database schema:

+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+

+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+

I'd recommend splitting at least the "words" table into multiple tables and/or servers.

By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().

我认为压缩是错误的方式，因为我必须在比较之前解压缩每个向量。

我现在的问题是 - 处理数百万张图片的方式是什么以及如何以快速有效的方式比较它们。我不明白“向量的切割”如何帮助我解决我的问题。

非常感谢 - 也许我可以在这里找到使用 libpuzzle libaray 的人。

干杯。

最佳答案

那么，让我们看看他们给出的例子并尝试扩展。

假设您有一个表，用于存储与每个图像相关的信息(路径、名称、描述等)。在该表中，您将包含一个用于压缩签名的字段，该字段在您最初填充数据库时计算并存储。让我们这样定义该表:

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

当您最初计算签名时，您还将计算签名中的一些单词:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

现在您可以将这些词放入表中，定义如下:

CREATE TABLE img_sig_words (
    image_id INTEGER NOT NULL,
    sig_word TEXT NOT NULL,
    FOREIGN KEY (image_id) REFERENCES images (image_id),
    INDEX (image_id, sig_word)
);

现在您插入到该表中，在找到该词的位置索引之前添加，以便您知道何时匹配一个词，它在签名中的相同位置匹配:

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

这样你的数据就初始化好了，你可以相对容易地抓取带有匹配词的图像:

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

您可以通过添加要求最小强度 的HAVING 子句来改进查询，从而进一步减少您的匹配集。

我不保证这是最有效的设置，但它应该大致可以实现您正在寻找的功能。

基本上，以这种方式拆分和存储单词可以让您进行粗略的距离检查，而无需对签名运行专门的函数。

关于php - Libpuzzle 索引数百万张图片？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9703762/

25

4

0

文章推荐： mysql - 如何在 MySQL LEFT JOIN 中使用别名

文章推荐： php - 如何使用 MySQL 在 Yii2 迁移中实现 AUTO_INCREMENT？

文章推荐： mysql - 将数据从 MySQL 二进制日志流式传输到 Kinesis

文章推荐： mysql - 轮询数据库模式

html - Z 索引 - 滑动过渡重叠有没有办法创建动态 z 索引？
这几天我一直在努力。我一直在自学 CSS，所以对菜鸟好一点。我正在创建一个推荐 slider 。推荐以 3 个 block 显示。我希望前 2 个下降，第 3 个上升。但是当 slider 激活时，无
node.js - 索引.ejs VS 索引.html
我最近开始学习 Nodejs，现在我很困惑我的网络应用程序使用什么，html 还是 ejs (Express)。 Ejs 使用 Express 模块，而 .html 使用 HTML 模块。我的第一个问
sql - 跨两列/数组的 PostgreSQL 搜索/索引(GIN 索引？)
假设我们有一个 PostgreSQL 表contacts，每条记录都有一堆带标签的电子邮件地址(标签和电子邮件对)——其中一个是“主要”。存储方式如下: id 主键电子邮件文本 email_la
Tesseract 索引 >= 0 && 索引 < size_used_ :Error:Assert failed Error
我成功为一种新的tesseract语言编写了traineddata文件，但是当我完成时，我继续收到以下错误: index >= 0 && index = 0 && 索引 < size_used_ :E
python - .loc[索引, 列] 和 .loc[索引][列] 之间有什么区别？
这个问题已经有答案了: How to deal with SettingWithCopyWarning in Pandas (21 个回答) 已关闭 4 年前。假设我有一个像这样的数据框，第一列“密
Android - 从位置 A(索引)检查位置 B(索引)是否在 GridView 布局中与它成对角线，而不管是否接近
如果我有一个位置或行/列同时用于 A 和 B 位置，请检查 B 是否与 A 成对角线？ 1 2 3 4 5 6 7 8 9 例如，我如何检查 5 是否与 7 成对角线？此外，如果我检查 4 是
MongoDB：索引
MongoDB：索引一、创建索引默认情况下，集合中的_id字段就是索引，我们可以通过getIndexes()方法来查看一个集合中的索引 > db.user.getIndexes() [ { "v
MongoDB——索引
一、索引介绍索引是一种用来快速查询数据的数据结构。 B+Tree就是一种常用的数据库索引数据结构，MongoDB采用B+Tree 做索引，索引创建在colletions上。 MongoDB不使用索引
SQLite 索引
我无法决定索引。就像我有下面的查询需要太多时间来执行: select count(rn.NODE_ID) as Count, rnl.[ISO_COUNTRY_CODE] as Cou
MySQL查询优化——索引
我有这些表: CREATE TABLE `cstat` ( `id_cstat` bigint(20) NOT NULL, `lang_code` varchar(3) NOT NULL,
mysql表性能升级(索引
我正在尝试找到一种方法来提高包含 IP 范围的 mysql 表的性能(在高峰时段每秒最多有 500 个 SELECT 查询(!)，所以我有点担心)。我有一个这种结构的表: id smallint(
jquery 索引()
jquery index() 似乎无法识别元素之一，总是说“无法读取未定义的属性‘长度’”这是我的代码。mnumber 是导致问题的原因。我需要 number 和 mnumber 才能跟踪使用鼠标，并
MongoDB 索引
我们有一个包含近 4000 万条记录的 MongoDB 集合。该集合的当前大小为 5GB。此集合中存储的数据包含以下字段: _id: "MongoDB id" userid: "user id" (i
MongoDB 索引
文档说:如果你有多个字段的复合索引，你可以用它来查询字段的开始子集。所以如果你有一个索引一个，乙，丙你可以用它查询一种一个，乙a,b,c 我的问题是，如果我有一个像这样的复合索引一个，乙，丙我可以查询
jQuery .each() 索引？
我正在使用 $('#list option').each(function(){ //do stuff }); 循环列表中的选项。我想知道如何获取当前循环的索引？因为我不想让 var i = 0;循
快速了解MySQL 索引
MySQL索引的建立对于MySQL的高效运行是很重要的，索引可以大大提高MySQL的检索速度。打个比方，如果合理的设计且使用索引的MySQL是一辆兰博基尼的话，那么没有设计和使用索引的MySQL
18、SQLite 索引
SQLite 索引（Index）索引（Index）是一种特殊的查找表，数据库搜索引擎用来加快数据检索。简单地说，索引是一个指向表中数据的指针。一个数据库中的索引与一本书后边的索引是非常相似的。
RavenDB MultiMap 索引
我是 RavenDB 的新手。我正在尝试使用多 map 索引功能，但我不确定这是否是解决我的问题的最佳方法。所以我有三个文件:Unit、Car、People。汽车文件看起来像这样: { Id: "
基于标准的 Excel 索引
我有以下数据，我想根据范围在另一个表中建立索引我想要实现的是，例如，如果三星的销售额为 2500，则折扣为 2%，低于 3000 且高于 1000 我知道它可以通过索引来完成，与多个数组匹配，然后指
SQL 索引 - 这是重叠的吗？
我正在检查并删除 SQL 数据库中的重复和冗余索引。所以如果我有两个相同的索引，我会删除。例如，如果我删除了重叠的索引... 索引1:品牌、型号指标二:品牌、型号、价格我删除索引 1。相同顺

首页

博学

6Ren·AI

商城

php - Libpuzzle 索引数百万张图片？