gpt4 book ai didi

java - 查找所有 "character-equal"字符串的高效算法?

转载 作者:塔克拉玛干 更新时间:2023-11-03 05:47:55 24 4
gpt4 key购买 nike

我们如何编写一个高效的函数来输出输入字符串的“homoglyph equivalents”?

示例 1(伪代码):

homoglyphs_list = [
["o", "0"], // "o" and "0" are homoglyphs
["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
]

input_string = "someinput"
output = [
"someinput", "s0meinput", "somelnput",
"s0melnput", "some1nput", "s0me1nput"
]

示例 2:

homoglyphs_list = [
["rn", "m", "nn"],
]

input_string = "rnn"
output = ["rnn", "rm", "mn", "rrn", "nnn", "nm", "nrn"]

示例 3:

homoglyphs_list = [
["d", "ci", "a"], // "o" and "0" are homoglyphs
["i", "l", "1"] // "i" and "l" and "1" are homoglyphs
]
/*
notice that with the two rules above,
we can infer "d" = "ci" = "a" = "cl" = "c1"
*/

input_string = "pacerier"
output = [
"pacerier", "pacerler", "pacer1er", "pcicerier",
"pcicerler", "pcicer1er", "pclcerier", "pc1cerier",
"pclcerler", "pc1cerler", "pclcer1er", "pc1cer1er",
"pdcerier", "pdcerler", "pdcer1er"
]

注意:输出数组中成员的顺序并不重要,我们可以假设给定的同形映射被认为是正确的(输入不会给我们一个“无限的”循环”)。

我当前的算法有效,但它使用的是原始暴力破解,性能很糟糕。例如。输入带有同形字符 ["rn", "m", "nn"]"mmmmm" 需要 38 秒才能运行:

// This is php code (non-pseudo just so we could test the running time), 
// but the question remains language-agnostic

public function Func($in, Array $mappings){
$out_table = array();
$out_table[$in] = null;
while(true){
$number_of_entries_so_far = count($out_table);
foreach(array_keys($out_table) as $key){
foreach($mappings as $mapping){
foreach($mapping as $value){
for($a=0, $aa=strlen($key); $a<$aa; ++$a){
$pos = strpos($key, $value, $a);
if($pos === false){
continue;
}
foreach($mapping as $equal_value){
if($value === $equal_value){
continue;
}
$out_table[substr($key, 0, $pos) . $equal_value . substr($key, $pos + strlen($value))] = null;
}
}

}
}
}
if($number_of_entries_so_far === count($out_table)){
// it means we have tried bruteforcing but added no additional entries,
// so we can quit now
break;
}
}
return array_keys($out_table);
}

我们如何实现高效(快速)的同形字扩展算法?

最佳答案

您应该从递归实现中获得一些性能,但我没有过多考虑实际性能。这将避免多次循环输出字符串并计算每次迭代的输出。此外,我还添加了一些“缓存”以防止两次计算相同的同形文字,为简单起见,它不会检查映射,但您可以实现它,或者在每次映射更改的调用之前简单地清除缓存(但那是丑陋且容易引入错误)。

代码如下:

cache = {}
def homoglyph(input_string, mappings):
output = []
character = input_string[0]
if input_string in cache:
return cache[input_string]
elif not input_string:
return []
for charset in mappings:
if character in charset:
temp = input_string
sub_homoglyphs = homoglyph(input_string[1:], mappings)
for char in charset:
temp[0] = char
output.append(temp)
#adding the homoglyph character to the rest of the string
for subhomoglyph in sub_homoglyphs:
output.append(char+subhomoglyph)
cache[input_string] = output
return output

(这是用 python 写的,因为我不太精通 php 语法,我没有实际运行过,所以可能会有错误,但逻辑的要点就在那里)

关于java - 查找所有 "character-equal"字符串的高效算法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18060037/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com