gpt4 book ai didi

javascript - Node.js/javascript minhash 模块,为相似的文本输出相似的哈希字符串

转载 作者:搜寻专家 更新时间:2023-11-01 00:48:40 26 4
gpt4 key购买 nike

我正在寻找一个 node.js/Javascript 模块,它将 minhash 算法应用于字符串或更大的文本,并为我返回该文本的“标识”或“特征”字节串或十六进制字符串。如果我将该算法应用于另一个相似的文本字符串,则哈希字符串也应该相似。是否已经存在这样的模块?

到目前为止,我正在研究的模块只能直接比较文本,并直接计算与比较文本的某种数字 jaccard 相似度,但我想为每个文档存储某种哈希字符串,所以如果我有相似的文本,我可以稍后比较字符串的相似性......

本质上,我正在寻找的是这里的这段代码(Java):在 Javascript 中: https://github.com/codelibs/elasticsearch-minhash

例如,对于像这样的字符串: "The quick brown fox jumps over the lazy dog""The quick brown fox jumps over the lazy d"它会为第一句话创建一个散列,例如:

"KV5rsUfZpcZdVojpG8mHLA=="

第二个字符串是这样的:

KV5rsSfZpcGdVojpG8mGLA==

两个哈希字符串差别不大...这就是 minhash 算法的重点,但是,我不知道如何创建类似的哈希字符串..到目前为止我找到的所有库,只能直接比较2 文档并创建相似系数,但它们不会创建文档特有的哈希字符串...与所有算法的相似之处在于,它们为单词标记数组(或带状疱疹)创建散列的 crc32(或类似)散列值。但我仍然不知道他们如何将这些哈希相互比较......

最佳答案

需要 Douglas Duhaime 实现 minhash ,但计算散列值数组的任何其他实现都可以使用相同的方式。

const str1 = "The quick brown fox jumps over the lazy dog";
const str2 = "The quick brown fox jumps over the lazy d";
console.log(str1);
console.log(str2);
var s1 = str1.split(' ');
var s2 = str2.split(' ');

// create a hash for each set of words to compare
// default numPerm is 128 but that gives very long hash
// below 8, almost similar string will give exactly the same hash
var m1 = new Minhash({numPerm: 8});
var m2 = new Minhash({numPerm: 8});

// update each hash
s1.map(function(w) { m1.update(w) });
s2.map(function(w) { m2.update(w) });


// estimate the jaccard similarity between two minhashes
console.log('jaccard similarity:', m1.jaccard(m2));

// Now to convert hashvalues to a string we use a kind of base64
// encode but since hasvalues is an array of 32bits integer we
// have to explode it into a array of 8bits integers first

// for a given int32 returns 4 bytes
function int32ToBytes(num) {
// the hexadecimal representation of the largest 32bits unsigned integer is 0xFFFFFFFF
// the hexadecimal representation of the largest unsigned integer (8bits === a byte) is 0xFF
// so it is possible to think a 32bits uint (unsigned integer) as the concatenation of 4 8bits uint.
// the bitwise & operator is the bitwise AND
// its table of truth is 0 & 0 = 0, 0 & 1 = 0, 1 & 0 = 0 and 1 & 1 = 1
// for instance 8 & 1 <=> 0b111 & 0b001 <=> 0b001 <=> 1

// the same is possible with hex representation:
// 65535 & 255 <=> 0xFFFF & 0x00FF <=> 0x0FF <=> 255
// 65535 & 65280 <=> 0xFFFF & 0xFF00 <=> 0xFF00 <=> 65280
// 255 + 65535 = 65535

// now about the bitwise >> shift operator
// a >> n shift the number a by n bits to the right
// in hex FF is 8bits so `0xFF00 >> 8 = 0xFF`
// this operation is reversible `0xFF << 8 = 0xFF00`

// 0xFFFF needs 16 bits to be represented, as 0xFF00
// but 0xFF only needs 8 bits
// so its possible to split a 16 bits integer into two 8 bits integer this way:
// int16 = (int16 & 0xFF00) >> 8 + (int16 & 0x00FF) >> 0
// no information was lost because we're able to do the reverse operation

// the same principle is used below to encode a 32 bits integer into 4 bytes (8bits integers)
// max uint32 = 0xFFFFFFFF =
// 0xFF << 24 + 0xFF << 16 + 0xFF << 8 + 0xFF << 0



const arr = [
(num & 0xff000000) >> 24,
(num & 0x00ff0000) >> 16,
(num & 0x0000ff00) >> 8,
(num & 0x000000ff)
];
return arr;
}

// tolerant base64 encode of 4 bytes
function Uint8ToString(u8a){
var CHUNK_SZ = 0x8000;
var c = [];
for (var i=0; i < u8a.length; i+=CHUNK_SZ) {
c.push(String.fromCharCode.apply(null, u8a.subarray(i, i+CHUNK_SZ)));
}
return c.join("");
}

// tolerant base64 encode of int32 array
function base64EncodeInt32Array(intArray) {
let str = '';
intArray.forEach((i) => {
var u8 = new Uint8Array(int32ToBytes(i));
var b64encoded = btoa(Uint8ToString(u8));
str += b64encoded;
});

return str;

}

// replace non significant '==' to shorten hash
console.log(base64EncodeInt32Array(m1.hashvalues).replace(/==/g, ''));
console.log(base64EncodeInt32Array(m2.hashvalues).replace(/==/g, ''));
<script src='https://rawgit.com/duhaime/minhash/master/minhash.min.js'></script>

关于javascript - Node.js/javascript minhash 模块,为相似的文本输出相似的哈希字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55251188/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com