gpt4 book ai didi

c# - Damerau–Levenshtein 距离算法,禁用删除计数

转载 作者:塔克拉玛干 更新时间:2023-11-03 03:28:47 33 4
gpt4 key购买 nike

如何在 Damerau-Levenshtein 距离算法的实现中禁用删除计数,或者如果已经实现了其他算法,请指出。

示例(禁用删除计数):

string1:你好吗?

string2:怎么样?

距离: 1(对于转置,4 次删除不算)

算法如下:

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;

// Return trivial case - where one is empty
if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
return (string1 ?? "").Length + (string2 ?? "").Length;


// Ensure string2 (inner cycle) is longer_transpositionRow
if (string1.Length > string2.Length)
{
var tmp = string1;
string1 = string2;
string2 = tmp;
}

// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
return string2.Length - string1.Length;

var length1 = string1.Length;
var length2 = string2.Length;

var d = new int[length1 + 1, length2 + 1];

for (var i = 0; i <= d.GetUpperBound(0); i++)
d[i, 0] = i;

for (var i = 0; i <= d.GetUpperBound(1); i++)
d[0, i] = i;

for (var i = 1; i <= d.GetUpperBound(0); i++)
{
var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;

var del = d[im1, j] + 1;
var ins = d[i, jm1] + 1;
var sub = d[im1, jm1] + cost;

//Math.Min is slower than native code
//d[i, j] = Math.Min(del, Math.Min(ins, sub));
d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

if (d[i, j] < minDistance)
minDistance = d[i, j];
}

if (minDistance > threshold)
return int.MaxValue;
}

return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[d.GetUpperBound(0), d.GetUpperBound(1)];
}

最佳答案

 public static int DamerauLevenshteinDistance( string string1
, string string2
, int threshold)
{
// Return trivial case - where they are equal
if (string1.Equals(string2))
return 0;

// Return trivial case - where one is empty
// WRONG FOR YOUR NEEDS:
// if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
// return (string1 ?? "").Length + (string2 ?? "").Length;

//DO IT THIS WAY:
if (String.IsNullOrEmpty(string1))
// First string is empty, so every character of
// String2 has been inserted:
return (string2 ?? "").Length;
if (String.IsNullOrEmpty(string2))
// Second string is empty, so every character of string1
// has been deleted, but you dont count deletions:
return 0;

// DO NOT SWAP THE STRINGS IF YOU WANT TO DEAL WITH INSERTIONS
// IN A DIFFERENT MANNER THEN WITH DELETIONS:
// THE FOLLOWING IS WRONG FOR YOUR NEEDS:
// // Ensure string2 (inner cycle) is longer_transpositionRow
// if (string1.Length > string2.Length)
// {
// var tmp = string1;
// string1 = string2;
// string2 = tmp;
// }

// Return trivial case - where string1 is contained within string2
if (string2.Contains(string1))
//all changes are insertions
return string2.Length - string1.Length;

// REVERSE CASE: STRING2 IS CONTAINED WITHIN STRING1
if (string1.Contains(string2))
//all changes are deletions which you don't count:
return 0;

var length1 = string1.Length;
var length2 = string2.Length;


// PAY ATTENTION TO THIS CHANGE!
// length1+1 rows is way too much! You need only 3 rows (0, 1 and 2)
// read my explanation below the code!
// TOO MUCH ROWS: var d = new int[length1 + 1, length2 + 1];
var d = new int[2, length2 + 1];

// THIS INITIALIZATION COUNTS DELETIONS. YOU DONT WANT IT
// or (var i = 0; i <= d.GetUpperBound(0); i++)
// d[i, 0] = i;

// But you must initiate the first element of each row with 0:
for (var i = 0; i <= 2; i++)
d[i, 0] = 0;


// This initialization counts insertions. You need it, but for
// better consistency of code I call the variable j (not i):
for (var j = 0; j <= d.GetUpperBound(1); j++)
d[0, j] = j;


// Now do the job:
// for (var i = 1; i <= d.GetUpperBound(0); i++)
for (var i = 1; i <= length1; i++)
{
//Here in this for-loop: add "%3" to evey term
// that is used as first index of d!

var im1 = i - 1;
var im2 = i - 2;
var minDistance = threshold;
for (var j = 1; j <= d.GetUpperBound(1); j++)
{
var jm1 = j - 1;
var jm2 = j - 2;
var cost = string1[im1] == string2[jm1] ? 0 : 1;

// DON'T COUNT DELETIONS! var del = d[im1, j] + 1;
var ins = d[i % 3, jm1] + 1;
var sub = d[im1 % 3, jm1] + cost;

// Math.Min is slower than native code
// d[i, j] = Math.Min(del, Math.Min(ins, sub));
// DEL DOES NOT EXIST
// d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
d[i % 3, j] = ins <= sub ? ins : sub;

if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
d[i % 3, j] = Math.Min(d[i % 3, j], d[im2 % 3, jm2] + cost);

if (d[i % 3, j] < minDistance)
minDistance = d[i % 3, j];
}

if (minDistance > threshold)
return int.MaxValue;
}

return d[length1 % 3, d.GetUpperBound(1)] > threshold
? int.MaxValue
: d[length1 % 3, d.GetUpperBound(1)];
}

这是我为什么只需要 3 行的解释:

看这一行:

var d = new int[length1 + 1, length2 + 1];

如果一个字符串的长度为 n,另一个字符串的长度为 m,那么您的代码需要 (n+1)*(m+1) 个整数的空间。每个 Integer 需要 4 Byte。如果您的字符串很长,这会浪费内存。如果两个字符串的长度都是 35.000 字节,那么您将需要超过 4 GB 的内存!

在此代码中,您为 d[i,j] 计算并写入一个新值。为此,您需要从其上邻居 (d[i,jm1])、左邻居 (d[im1,j])、从其左上邻居 (d[im1,jm1]),最后来自其双上双左邻居 (d[im2,jm2])。因此,您只需要实际行和之前 2 行的值。

您永远不需要任何其他行的值。那么为什么要存储它们呢?三行就足够了,我的更改确保您可以使用这 3 行,而不会在任何时候读取任何错误的值。

关于c# - Damerau–Levenshtein 距离算法,禁用删除计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12027324/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com