gpt4 book ai didi

MySQL 单列 n-gram 拆分和计数

转载 作者:行者123 更新时间:2023-11-29 10:44:56 28 4
gpt4 key购买 nike

给定MySQL中的一列字符串(密码)并给定一个值N,我正在寻找一种sql方式来计算每个n-的频率gram(长度为 n 的子串)。

将代码保留在 MySQL 中很重要,因为在我拥有的其他环境中,它会导致内存溢出。

我同时发现的唯一有效方法是假设字符串的长度有限(合法假设),通过提取不同位置的子字符串分别选择union,然后group bycount,如下所示(对于 13 个字符中的 9 克):

Select 
nueve,
count(*) as density,
avg(location) as avgloc

From
(select
mid(pass, 1, 9) as nueve, 1 as location
from
passdata
where
length(pass) >= 9 and length(pass) <= 13 UNION ALL select
mid(pass, 2, 9), 2 as location
from
passdata
where
length(pass) >= 10 and length(pass) <= 13 UNION ALL select
mid(pass, 3, 9), 3 as location
from
passdata
where
length(pass) >= 11 and length(pass) <= 13 UNION ALL select
mid(pass, 4, 9), 4 as location
from
passdata
where
length(pass) >= 12 and length(pass) <= 13 UNION ALL select
mid(pass, 5, 9), 5 as location
from
passdata
where
length(pass) = 13) as nueves
group by nueve
order by density DESC

结果如下所示:

nueve     density avgloc
123456789 1387 2.4564
234567890 193 2.7306
987654321 141 2.0355
password1 111 1.7748
123123123 92 1.913
liverpool 89 1.618
111111111 86 2.2791

其中 nueve 是 9-gram,密度 是出现次数,avgloc 是字符串中的平均起始位置

有什么改进查询的建议吗?我也对其他 n-gram 做同样的事情。

谢谢!

最佳答案

创建一个表,其中包含从 1 到密码最大长度的所有数字。然后您可以与此连接来获取子字符串位置。

SELECT nueve, COUNT(*) AS density, AVG(location) as avgloc
FROM (
SELECT MID(p.pass, n.num, @N) AS nueve, n.num AS location
FROM passdata AS p
JOIN numbers_table AS n ON LENGTH(p.pass) >= (@N + n.num - 1)
) AS x
GROUP BY nueve
ORDER BY density DESC

关于MySQL 单列 n-gram 拆分和计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44768057/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com