sql - 数据库中的汉明距离/相似性搜索-6ren

sql - 数据库中的汉明距离/相似性搜索

转载作者：行者123 更新时间：2023-12-04 17:55:02

我有一个过程，类似于生成感知哈希的 tineye，这些是 32 位整数。

我打算将来将这些存储在 sql 数据库(可能是 nosql db)中

但是，我对如何根据哈希的相似性检索记录感到困惑。

有任何想法吗？

最佳答案

一种常见的方法(至少对我来说是常见的)是将您的哈希位字符串分成几个块并在这些块上查询以进行精确匹配。这是一个“预过滤”步骤。然后，您可以对返回的结果执行按位汉明距离计算，这应该只是整个数据集的一个较小子集。这可以使用数据文件或 SQL 表来完成。

简单来说:假设您在数据库中有一堆 32 位散列，并且您想找到在“查询”散列的 4 位汉明距离内的每个散列:

创建一个包含四列的表:每列将包含 32 位散列的 8 位(作为字符串或整数)切片，islice 1 到 4。

在 qslice 1 到 4 中以相同的方式对查询哈希进行切片。

查询此表，使得 qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4 中的任何一个.这为您提供了查询哈希的 3 位 ( 4 - 1 ) 内的每个数据库哈希。

对于每个返回的散列，与查询散列成对计算精确的汉明距离(从四个切片重建索引端散列)

步骤 4 中的操作次数应该远少于整个表的完整成对汉明计算。

这种方法首先由 Moses Charikar 描述 afaik在其“simhash”开创性 paper和相应的谷歌 patent :

APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...]

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger在她的论文中对此进行了扩展 "Finding near-duplicate web pages: a large-scale evaluation of algorithms" :

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

这在论文 Detecting Near-Duplicates for Web Crawling 中也有解释。作者:Gurmeet Singh Manku、Arvind Jain 和 Anish Das Sarma:

THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to “almost a counter” in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS:FWIW，这些优秀的大脑中的大多数都在某种程度上或某个时间与谷歌有关。

关于sql - 数据库中的汉明距离/相似性搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9606492/