11486, "opt_ids"=> [3545, 3546, 3548, 3550] }, {"id"=>12624, "opt_ids"=> [3545, 3396, 3-6ren">
gpt4 book ai didi

ruby-on-rails - 匹配具有最大相似性的 2 个哈希的有效方法

转载 作者:行者123 更新时间:2023-12-04 15:06:00 26 4
gpt4 key购买 nike

[
{"id"=>11486, "opt_ids"=> [3545, 3546, 3548, 3550] },
{"id"=>12624, "opt_ids"=> [3545, 3396, 3548, 3550] },
{"id"=>14588, "opt_ids"=> [3394, 3396, 3397, 3399] },
{"id"=>14589, "opt_ids"=> [3394, 3545, 3398, 3548] },
{"id"=>14590, "opt_ids"=> [3394, 3396, 3397, 3399, 3545, 3547, 3548, 3551, 3653, 3655, 3657, 3660, 3772, 3775, 3777, 3778]},
.....
.....
...
...

]

Is there an efficient way of finding 2 id's which would have themaximum number of similar option_ids?

上面例子的答案是

[[11486, 12624], [14588, 14590]]

我已经尝试过的是-

  1. 获取每个散列并将其 opt_ids 与数组中其他剩余散列的 opt_ids 进行比较。
  2. 与当前哈希的 opt_id 最匹配的哈希,我将这 2 个 id 配对。
  3. 所以我实际上是在遍历每个散列的次数与数组中散列的数量一样多 - O(n^2)

最佳答案

sqlfiddle I have put the table here with the same data as above. I have created a view from many different tables.

用SQL做,这就是它擅长的。

使用自连接获取每对的重叠数。

select
a.emp_id emp_id1,
b.emp_id emp_id2,
count(a.option_id) as overlap
from data a
join data b on
-- Ensure we count each pair only once
a.emp_id < b.emp_id and
a.option_id = b.option_id
group by a.emp_id, b.emp_id

然后将其用作 CTE 以选择重叠最多的对。

with overlaps as (
select
a.emp_id emp_id1,
b.emp_id emp_id2,
count(a.option_id) as overlap
from data a
join data b on
a.emp_id < b.emp_id and
a.option_id = b.option_id
group by a.emp_id, b.emp_id
)
select *
from overlaps
where overlap = (
select max(overlap)
from overlaps
)

只要您被编入索引,这应该比将所有数据拉出到 Ruby 中执行得更好。

create index idx_option_emp_ids on data(option_id, emp_id);

即使没有索引,它的性能也应该比将其全部放入 Ruby 中要好得多。

关于ruby-on-rails - 匹配具有最大相似性的 2 个哈希的有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66095948/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com