What distance measure can I use that factors in order? [closed](我可以用什么距离来衡量这些因素的顺序？[已关闭])-6ren

What distance measure can I use that factors in order? [closed](我可以用什么距离来衡量这些因素的顺序？[已关闭])

转载作者：bug小助手更新时间：2023-10-24 21:08:57

I have a few lists that havethe same IDs that are strings. They are as follows:

我有几个列表，它们的ID与字符串相同。这些建议如下：

list1 = ["1", "2", "3", "4", "5"]
list2 = ["1", "2", "3", "5", "4"]
list3 = ["1", "5", "4", "3", "2"]
list4 = ["4", "2", "5", "3", "1"]

What measure can I use to determine the lists that are closest to each other here in terms of order? Ideally list1 and list2 should be the closest here.

我可以使用什么方法来确定在这里顺序上彼此最接近的列表？理想情况下，list1和list2应该是这里最接近的。

Does the spearman correlation make sense here?

斯皮尔曼的相关性在这里有意义吗？

更多回答

What, exactly, do you mean by "closest to each other here in terms of order"? What makes one pair of lists closer than another pair? Should ["1", "5", "4", "3", "2"] and ["2", "1", "3", "4", "5"] be considered closer than ["1", "2", "3", "4", "5"] and ["1", "2", "4", "3", "5"]?

你所说的“在秩序上彼此最接近”到底是什么意思？是什么让一对清单比另一对清单更接近？[“1”、“5”、“4”、“3”、“2”]和[“2”、“1”、“4”、“5”]是否应该比[“1”、“2”、“3”、“4”、“5”]和[“1”、“2”、“4”、“3”、“5”]更接近？

You've tagged this levenshtein-distance, but Levenshtein distance has nothing to do with order. Spearman's correlation coefficient isn't applicable either. You don't have two rank variables to correlate, and even if you did, Spearman's correlation coefficient is a correlation coefficient, not a distance metric.

你已经把这个标记为-距离，但距离与秩序无关。斯皮尔曼的相关系数也不适用。你没有两个等级变量来关联，即使你关联了，斯皮尔曼的相关系数也是一个相关系数，而不是距离度量。

And is list4 supposed to have 2 "2"s and no "1"? What space are these samples supposed to be drawn from?

列表4是不是应该有2“2”S而没有“1”？这些样本应该是从哪个空间提取的？

Thanks @user2357112 for the observations. Ideally list1 = ["1", "2", "3", "4", "5"] and list2 = ["1", "2", "3", "5", "4"] as mentioned has elements 1, 2 and 3 in the same positions. So out of five, three of them are in the same position. That's the order I'm referring to, so should be deemed very close Compared to list3 and list4. Sorry I wrongly tagged Levenshtein distance. In your opinion, how can I measure how similar the lists are based on element positions as explained earlier?

感谢@user2357112的评论。理想情况下，如上所述，列表1 = [“1”，“2”，“3”，“4”，“5”]和列表2 = [“1”，“2”，“3”，“5”，“4”]具有处于相同位置的元素1、2和3。所以五个人中，有三个人的位置相同。这就是我所指的顺序，所以应该被认为是非常接近的相比，名单3和名单4。对不起，我错误地标记了Levenshtein距离。在你看来，我如何根据前面解释的元素位置来衡量列表的相似程度？

So rather than lists being close to each other in the lexicographic order of all lists in whatever the sample space is, you're looking for something that captures some notion of one list's internal order being similar to another list's internal order.

因此，无论样本空间是什么，列表都不是按照所有列表的词典顺序彼此接近，而是寻找一些东西来捕捉一个列表的内部顺序与另一个列表的内部顺序类似的概念。

优秀答案推荐

Edit distance seems to be a good candidate for such a metric.

编辑距离似乎是这样一个指标的一个很好的候选者。

from typing import List


def calcEditDistance(lhs: List[str], rhs: List[str]) -> int:
    '''
    Dynamic programming
    dp[i][j] = minimum number of operations to convert lhs[0:i] to rhs[0:j]
    '''
    m = len(lhs)
    n = len(rhs)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        dp[i][0] = i

    for j in range(1, n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if lhs[i - 1] == rhs[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = min(dp[i - 1][j - 1], dp[i - 1]
                               [j], dp[i][j - 1]) + 1

    return dp[m][n]


list1 = ["1", "2", "3", "4", "5"]
list2 = ["1", "2", "3", "5", "4"]
list3 = ["1", "5", "4", "3", "2"]
list4 = ["4", "2", "5", "3", "2"]

res = calcEditDistance(list1, list2)
print(f"dis[1, 2] = {res}")

res = calcEditDistance(list1, list3)
print(f"dis[1, 3] = {res}")

res = calcEditDistance(list1, list4)
print(f"dis[1, 4] = {res}")

res = calcEditDistance(list2, list3)
print(f"dis[2, 3] = {res}")

res = calcEditDistance(list2, list4)
print(f"dis[2, 4] = {res}")

res = calcEditDistance(list3, list4)
print(f"dis[3, 4] = {res}")

prints

指纹

dis[1, 2] = 2
dis[1, 3] = 4
dis[1, 4] = 4
dis[2, 3] = 4
dis[2, 4] = 4
dis[3, 4] = 3

which matches your intuition.

这与你的直觉相符。

Note that in the Python code I use the Levenshtein distance where insert, delete, and replace operations are allowed. You can, of course, use other types of edit distance.

请注意，在Python代码中，我使用了允许执行INSERT、DELETE和REPLACE操作的Levenshtein距离。当然，您可以使用其他类型的编辑距离。

The comments have clarified that you're looking for something where two lists are closer together the more elements they have in the same positions. In that case, just count how many elements they have in different positions:

这些评论澄清了你正在寻找的东西，两个列表离得越近，它们在相同位置的元素就越多。在这种情况下，只需数一数它们在不同位置有多少元素：

def distance(l1, l2):
    return sum(1 for i, j in zip(l1, l2) if i != j)

更多回答

文章推荐： Date data- type in R programing(R编程中的日期数据类型)

c++ - 为什么我得到一个无限循环(因素)？
The proper divisors of a positive integer, n, are all the positive integers that divide n evenly oth
pipe - 因素 sudo 跨管道
我有这个命令行 $ sudo find /etc/grub.d | sort | tail -n 1 | xargs sudo cat | wc 我想用一个 sudo 命令执行 $ sudo --so
Qt 大小策略和拉伸(stretch)因素
选项大小策略和拉伸(stretch)因子如何影响小部件的大小？下图显示了三个不同排列的窗口的预览。对于所有三个窗口 (W1-W3)，右侧的小部件是一个 QFrame 小部件，其水平和垂直大小策略设置
重新编码/重新调整具有不同级别的 data.frame 因素
每次当我必须重新编码一组变量时，我都会想到 SPSS 重新编码功能。我必须承认这很简单。有一个类似的recode函数在 car包，它可以解决问题，但让我们假设我想用 factor 完成任务. 我有 d
c++ - 什么是 Unresolved external 因素？
这个问题在这里已经有了答案: Template issue causes linker error (C++) [duplicate] (6 个答案) 关闭 9 年前。我的问题查了没用所以特地来问
c++ - SimplicialLLT 返回错误的 cholesky 因素
我想使用 Eigen 来计算稀疏矩阵的 cholesky 分解。但是，结果不正确，我找不到原因。我如何获得正确答案？ Eigen 中是否实现了特殊例程，利用稀疏矩阵的结构来提高性能(例如，对于下例中的
angularjs - 如何使 angularjs 应用程序在配置方面符合 12 因素
我正在尝试使 angularjs 应用程序在配置( http://12factor.net/config )方面符合 12 因素。它应该取决于环境，我不应该看到 development 字样, te
c++ - 使用 Soil Unresolved external 因素
我在我的项目中使用 Soil，我在我的包含目录中添加了 soil，在我的预编译头文件中我包含了“Soil.h”。对于我预编译头中的库，我添加了这个: #pragma comment(lib,"SOIL
java - 如何将本地时间转换为 UTC，同时牢记 DayLightSaving 因素
在我的 Web 应用程序中，我将所有最终用户的日期信息以 UTC 格式存储在数据库中，在向他们显示之前，只需将 UTC 日期转换为他们选择的时区。我正在使用此方法将本地时间转换为 UTC 时间(在存
java - 使用 JDBC 设置高 maxPoolSize 时需要注意哪些风险/因素
我的申请是 Piwik Server从放置在数百个网站上的跟踪代码接收传入的跟踪数据。当这些跟踪请求进入时，大部分工作负载是每秒向数据库写入数百次。我使用的是带有 JDBC 和 Hibernate 的
android - 居中 GWT DialogBox 不考虑我手机浏览器的 "zoom"因素
我有一个非常简单的 GWT 应用程序，它收集一些数据并在用户单击“提交”时提供确认对话框。我创建了一个 com.google.gwt.user.client.ui.DialogBox，填充它，然后调用
delphi - Delphi IBX TIBSQL.ExecQuery 是否有奇怪的事务要求(FStreamedActive 因素)？
我正在使用 Delphi(2009 年，没关系)和 IBX，并且我正在尝试执行简单的代码: TestSQL.ExecQuery; 在此代码之前，我已检查(也可以在调试器监视中看到)TestSQL.Tr
c - BLAS/cuBLAS 如何在他们的程序中处理 alpha 和 beta 因素？
许多线性代数例程都将常量(例如 alpha 和 beta)作为参数。例如cublas?GEMM执行以下操作: C := alpha*op( A )op( B ) + betaC 假设我将 beta 设

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

What distance measure can I use that factors in order? [closed](我可以用什么距离来衡量这些因素的顺序？[已关闭])