gpt4 book ai didi

PHP - 相互比较多维子数组并合并相似度阈值

转载 作者:行者123 更新时间:2023-12-04 03:14:01 25 4
gpt4 key购买 nike

简介 - 此问题已于 2018 年 5 月 27 日更新:

我有 1 个 PHP 多维数组,包含 6 个子数组,每个子数组包含 20 个子子数组,每个子数组又包含 2 个子子数组,一个是字符串(header),另一个是未指定数量的关键字(keywords)。

我希望将 120 个子子数组中的每一个与其余 5 个子数组中包含的 100 个其他子子数组进行比较。所以 sub-sub-array1 in sub-array1sub-array 比较1sub-array20 并包含 sub-array2 到并包括子数组6,等等。

如果两个子子数组中的足够多的关键字被认为是相同的并且标题也是相同的,都使用 Levenshtein 距离,子子数组将被合并。


示例脚本

我已经编写了一个脚本来执行此操作,但使用两个单独的数组来演示我的目标:

<?php
// Variable deciding maximum Levenshtein distance between two words. Can be changed to lower / increase threshhold for whether two keywords are deemed identical.
$lev_point_value = 3;

// Variable deciding minimum amount of identical (passed the $lev_point_value variable) keywords needed to merge arrays. Can be changed to lower / increase threshhold for how many keywords two arrays must have in common to be merged.
$merge_tag_value = 4;

// Variable deciding minimum Levenshtein distance between two headers needed to merge arrays. Can be changed to lower / increase threshhold for whether two titles are deemed identical.
$merge_head_value = 22;

// Array1 - A story about a monkey, includes at header and keywords.
$array1 = array (
"header" => "This is a story about a monkey.",
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing"
));

// Array1 - Another, but slightly different story about a monkey, includes at header and keywords.
$array2 = array (
"header" => "This is another, but different story, about a monkey.",
'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts"
));

// Function comparing keywords between two arrays. Uses levenshtein distance lesser than $lev_point_value. Each pass increases $merged_tag, which is then returned.
function sim_tag_index($array1, $array2, $lev_point_value) {
$merged_tag = 0;
foreach ($array1['keywords'] as $item1){
foreach ($array2["keywords"] as $item2){
if (levenshtein($item1, $item2) <= $lev_point_value) {
$merged_tag++;
};
}
};
return $merged_tag;
}

// Function comparing headers between two arrays using levenshtein distance, which is then returned as $merged_head.
function sim_header_index($array1, $array2) {
$merged_head = (levenshtein($array1['header'], $array2['header']));
return $merged_head;
}

// Function running sim_tag_index against $merge_tag_value, if it passes, then running sim_tag_index against $merge_head_value, if this passes aswell, merge arrays.
function merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value) {
$group = array();
if (sim_tag_index($array1, $array2, $lev_point_value) >= $merge_tag_value) {
if (sim_header_index($array1, $array2) >= $merge_head_value) {
$group = (array_unique(array_merge($array1["keywords"],$array2["keywords"])));
}
}
return $group;
}

// Printing function merge_on_sim.
print_r (merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value));
?>

问题:

我如何扩展或重写我的脚本以遍历多个子子数组,将它们与在其他子数组中找到的所有其他子子数组进行比较,然后合并被视为的子子数组足够相同吗?


多维数组结构

$array = array (
// Sub-array 1
array (
// Story 'Monkey 1' - Has identical sub-sub-arrays 'Monkey 2' and 'Monkey 3' and will be merged with them.
array (
"header" => "This is a story about a monkey.",
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing")
),
// Story 'Cat 1' - Has identical sub-sub-array 'Cat 2' and will be merged with it.
array (
"header" => "Here's a catarific story about a cat",
'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon")
)
),
// Sub-array 2
array (
// Story 'Monkey 2' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 3' and will be merged with them.
array (
"header" => "This is another, but different story, about a monkey.",
'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts")
),
// Story 'Cat 2' - Has identical sub-sub-array 'Cat 1' and will be merged with it.
array (
"header" => "Here's a different story about a cat",
'keywords' => array( "meauwe", "ball", "cat", "kitten", "claws", "sleep", "fish", "purr")
)
),
// Sub-array 3
array (
// Story 'Monkey 3' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 2' and will be merged with them.
array (
"header" => "This is a third story about a monkey.",
'keywords' => array( "Jungle", "tree", "monkey", "Bonobo", "Fun", "Dance", "climbing", "Coconut", "pretty")
),
// Story 'Fireman 1' - Has no identical sub-sub-arrays and will not be merged.
array (
"header" => "This is a story about a fireman",
'keywords' => array( "fire", "explosion", "burning", "rescue", "happy", "help", "water", "car")
)
)
);

想要的多维数组

$array = array (
// Story 'Monkey 1', 'Monkey 2' and 'Monkey 3' merged.
array (
"header" => array( "This is a story about a monkey.", "This is another, but different story, about a monkey.", "This is a third story about a monkey."),
'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing", "Fun", "Dance", "Cow", "Coconuts", "Jungle", "tree", "pretty")
),
// Story 'Cat 1' and 'Cat 2' merged.
array (
"header" => array( "Here's a catarific story about a cat", "Here's a different story about a cat"),
'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon", "ball", "cat", "kitten", "sleep", "fish", "purr")
)
);

最佳答案

我会试一试的!

我使用 preg_grep 来查找其他子数组中相同的项目。然后我用count看看有多少匹配的关键字。
这就是阈值所在。目前我将它设置为2,这意味着两个匹配的关键字是匹配的。

//flatten array to make it simpler
$new =[];
foreach($array as $subarr){
$new = array_merge($new, $subarr);
}

$threshold = 2;
$merged=[];
foreach($new as $key => $story){
// create regex pattern to find similar items
$words = "/" . implode("|", $story["keywords"]) . "/i";
foreach($new as $key2 => $story2){
// only loop new items and items that has not been merged already
if($key != $key2 && $key2 > $key && !in_array($key2, $merged)){
// If the count of words from preg_grep is above threshold it's mergable
if(count(preg_grep($words, $story2["keywords"])) > $threshold){
// debug
//echo $key . " " . $key2 . "\n";
//echo $story["header"] . " = " . $story2["header"] ."\n\n";

// if the item does not exist create it first to remove notices
if(!isset($res[$key])) $res[$key] = ["header" => [], "keywords" =>[]];

// add headers
$res[$key]["header"][] = $story["header"];
$res[$key]["header"][] = $story2["header"];

// only keep unique
$res[$key]["header"] = array_unique($res[$key]["header"]);

// add keywords and remove duplicates
$res[$key]["keywords"] = array_merge($res[$key]["keywords"], $story["keywords"], $story2["keywords"]);
$res[$key]["keywords"] = array_unique($res[$key]["keywords"]);

// add key2 to merged so that we don't merge this again.
$merged[] = $key2;
}
}
}
}

var_dump($res);

https://3v4l.org/6cKRq

输出是你“想要”的问题。

关于PHP - 相互比较多维子数组并合并相似度阈值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42610181/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com