gpt4 book ai didi

c# - 我如何实现类似 "phonetical"的搜索

转载 作者:太空狗 更新时间:2023-10-29 23:18:39 24 4
gpt4 key购买 nike

目前我正在尝试增强我的搜索算法。

为了更好地理解,以下是其背后的当前逻辑:
我们在数据库中有附加 n 个关键字的对象。在数据库中,这是通过 2 个表(ObjectKeyword)解决的,其中 Keyword-table 有一个 FK 到 Object。当我构建我的搜索树时,我创建了一个对象的所有关键字的行值(广告:删除变音符号,转换为小写,...)。相同的转换例程 (NormalizeSearchPattern()) 是用搜索模式完成的。我支持 AND - 搜索和最小长度为 2 个字符的关键字!

搜索算法目前是 fast-reverse-search 的变体(这个例子没有优化):

bool IsMatch(string source, string searchPattern)
{
// example:
// source: "hello world"
// searchPattern: "hello you freaky funky world"
// patterns[]: { "hello", "you", "freaky", "funky", "world" }

searchPattern = NormalizeSearchPattern(searchPattern);
var patterns = MagicMethodToSplitPatternIntoPatterns(searchPattern);
foreach (var pattern in patterns)
{
var success = false;
var patternLength = pattern.Length;
var firstChar = pattern[0];
var secondChar = pattern[1];

var lengthDifference = input.Length - patternLength;
while (lengthDifference >= 0)
{
if (source[lengthDifference--] != firstChar)
{
continue;
}
if (source[lengthDifference + 2] != secondChar)
{
continue;
}

var l = lengthDifference + 3;
var m = 2;
while (m < patternLength)
{
if (input[l] != pattern[m])
{
break;
}
l++;
m++;
}

if (m == patternLength)
{
success = true;
}
}
if (!success)
{
return false;
}
}
return true;
}

规范化完成(这个例子没有优化)

    string RemoveTooShortKeywords(string keywords)
{
while (Regex.IsMatch(keywords, TooShortKeywordPattern, RegexOptions.Compiled | RegexOptions.Singleline))
{
keywords = Regex.Replace(keywords, TooShortKeywordPattern, " ", RegexOptions.Compiled | RegexOptions.Singleline);
}

return keywords;
}

string RemoveNonAlphaDigits(string value)
{
value = value.ToLower();
value = value.Replace("ä", "ae");
value = value.Replace("ö", "oe");
value = value.Replace("ü", "ue");
value = value.Replace("ß", "ss");

return Regex.Replace(value, "[^a-z 0-9]", " ", RegexOptions.Compiled | RegexOptions.Singleline);
}

string NormalizeSearchPattern(string searchPattern)
{
var resultNonAlphaDigits = RemoveNonAlphaDigits(searchPattern);
var resultTrimmed = RemoveTooShortKeywords(resultNonAlphaDigits);
return resultTrimmed;
}

所以这是非常简单的,因此很明显,我只能处理我在 NormalizeSearchPattern( )(如上所述:元音变音、大小写差异……)。

但是当归结为:

  • 单数/复数
  • 输入错误(例如“hauserr”<->“hauser”)
  • ...

只是为了了解更多关于设计的信息:
这个应用程序是用 c# 完成的,它将搜索树和对象存储在一个静态变量中(在初始化时只查询一次数据库),性能必须非常出色(目前在不到 300 毫秒内查询 500.000 行值)。

最佳答案

您可能还对 Trigram and Bigram search matching algorithm 感兴趣:

Trigram search is a powerful method of searching for text when the exact syntax or spelling of the target object is not precisely known. It finds objects which match the maximum number of three-character strings in the entered search terms, i.e. near matches. A threshold can be specified as a cutoff point, after which a result is no longer regarded as a match.

关于c# - 我如何实现类似 "phonetical"的搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4314043/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com