gpt4 book ai didi

sql - 在 SQL Server 中如何合并相似的字符串以用于计数目的

转载 作者:行者123 更新时间:2023-12-04 23:47:21 26 4
gpt4 key购买 nike

我构建了一个查询,用于查找列的最长公共(public)子串并按频率对它们进行排序。我遇到的问题是删除/分组相似的结果。

这是下面代码的 TOP 5 输出 - 请注意“I love mittens the cat”是最长、最频繁的字符串,但代码还会找到该字符串的所有子字符串,例如“I love mittens the ca”或“我喜欢连指手套 c”。

    I love Mittens the cat  3
I love Mittens the ca 3
love Mittens the cat 3
love Mittens the ca 3
I love Mittens the c 3

如果可能,我想删除任何与其他具有部分单词的子串相似的子串。第 3 行会很好,因为它都是完整的单词,但是第 4 行和第 5 行应该被删除,因为它们与第 1 行相似。

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE @MaxLength INT = 50 --Maximum Substring Length
DECLARE @Delimeter VARCHAR(5) = ' '
DECLARE @T TABLE
(
ID INT IDENTITY
, chvStrings VARCHAR(64)
)
INSERT INTO @T VALUES
('I like cats'),
('I like dogs'),
('cats are great'),
('look at that cat'),
('I love Mittens the cat'),
('I love Mittens the cat a lot'),
('I love Mittens the cat so much'),
('Dogs are okay, I guess...')

SELECT TOP 10000
SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
COUNT(M.number) AS [Count]
FROM
@T as T
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE
N.number <= LEN(t.chvStrings) - M.number + 1
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1)
GROUP BY
SUBSTRING(T.chvStrings, N.Number, M.Number)
ORDER BY
COUNT(T.chvStrings) DESC,
LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC

最佳答案

我添加了几个额外的过滤器来说明子字符串 N.Number-1 不能包含字母 [a-z0-9],同样子字符串 M.Number+1 不能是 [a-z0-9] ].

这是您需要的吗?修改后的代码:

DECLARE     @MinLength INT          = 5     --Minimum Substring Length
DECLARE @MaxLength INT = 50 --Maximum Substring Length
DECLARE @Delimeter VARCHAR(5) = ' '
DECLARE @T TABLE
(
ID INT IDENTITY
, chvStrings VARCHAR(64)
)
INSERT INTO @T VALUES
('I like cats'),
('I like dogs'),
('cats are great'),
('look at that cat'),
('I love Mittens the cat'),
('I love Mittens the cat a lot'),
('I love Mittens the cat so much'),
('Dogs are okay, I guess...')

SELECT TOP 10000
SUBSTRING(T.chvStrings, N.Number, M.Number) AS Word,
COUNT(M.number) AS [Count]
--SUBSTRING(T.chvStrings,M.Number+1,1)
FROM
@T as T
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN 1 AND LEN(T.chvStrings)) N
CROSS APPLY
(SELECT N.Number
FROM [master]..spt_values as N
WHERE N.type = 'P'
AND N.number BETWEEN @MinLength AND @MaxLength) M
WHERE
N.number <= LEN(t.chvStrings) - M.number + 1
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '% '
AND SUBSTRING(T.chvStrings, N.Number, M.Number) NOT LIKE '%[_]%'
AND (SUBSTRING(T.chvStrings, N.Number,1) = @Delimeter OR N.number = 1)
AND SUBSTRING(T.chvStrings,M.Number+1,1) NOT LIKE '%[a-z0-9]%'
AND SUBSTRING(T.chvStrings,N.Number-1,1) NOT LIKE '%[a-z0-9]%'
GROUP BY
SUBSTRING(T.chvStrings, N.Number, M.Number)
ORDER BY
COUNT(T.chvStrings) DESC,
LEN(SUBSTRING(T.chvStrings, N.Number, M.Number)) DESC

关于sql - 在 SQL Server 中如何合并相似的字符串以用于计数目的,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37712945/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com