R - 不同的结果 gower.dist 和 daisy(...,metric ="gower")-6ren

R - 不同的结果 gower.dist 和 daisy(...,metric ="gower")

转载作者：行者123 更新时间：2023-11-30 08:43:55

我想计算两个数据帧的行之间的距离(相异性)，以便为每个观察找到最接近的簇。因为我有因子和数值变量，所以我使用高尔距离。由于我想比较两个数据帧(而不是一个矩阵的行之间的差异)， gower.dist 将是我需要的函数。然而，当我实现它时，我意识到结果与我使用 daisy's gower 时得到的结果不同，将行绑定(bind)在一起并查看感兴趣的相异矩阵的部分。

我在这里只提供了我的数据样本，但是当我计算所有数据的差异时，gower.dist 经常导致差异为零，尽管相应的行彼此不相等。为什么？造成不同结果的原因可能是什么？在我看来，daisys 的 gower 工作正常，而 gower.dist 则不然(在本例中)。

library(cluster)
library(StatMatch)

# Calculate distance using daisy's gower 
daisyDist <- daisy(rbind(df,cent),metric="gower")
daisyDist <- as.matrix(daisyDist)
daisyDist <- daisyDist[(nrow(df)+1):nrow(daisyDist),1:nrow(df)] #only look at part where rows from df are compared to (rows of) cent

# Calculate distance using dist.gower
gowerDist <- gower.dist(cent,df)

具有以下数据

df <- structure(list(searchType = structure(c(NA, 1L, 1L, 1L, 1L), .Label = c("1", "2"), class = "factor"), roomMin = structure(c(4L, 1L, 1L, 6L, 6L), .Label = c("10", "100", "150", "20", "255", "30", "40", "50", "60", "70", "Missing[NoInput]"), class = "factor"), roomMax = structure(c(8L, 8L, NA, 10L, 9L), .Label = c("10", "100", "120", "150", "160", "20", "255", "30", "40", "50", "60", "70", "80", "90", "Missing[NoInput]"), class = "factor"), priceMin = c(NA, 73, 60, 29, 11), priceMax = c(35, 11, 1, 62, 23), sizeMin = structure(c(5L, 5L, 5L, 6L, 6L), .Label = c("100", "125", "150", "250", "50", "75", "Missing[NoInput]"), class = "factor"), sizeMax = structure(c(1L, 6L, 5L, 3L, 1L), .Label = c("100", "125", "150", "250", "50", "75", "Missing[NoInput]"), class = "factor"), longitude = c(6.6306, 7.47195, 8.5562, NA, 8.569), latitude = c(46.52425, 46.9512, 47.37515, NA, 47.3929), specificSearch = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), objectType = structure(c(NA, 2L, 2L, 2L, 2L), .Label = c("1", "2", "3", "Missing[]"), class = "factor")), .Names = c("searchType", "roomMin", "roomMax", "priceMin", "priceMax", "sizeMin", "sizeMax", "longitude", "latitude", "specificSearch", "objectType"), row.names = c(112457L,  94601L, 78273L, 59172L, 117425L), class = "data.frame")                                                                                                                                                                
cent <- structure(list(searchType = structure(c(1L, 1L, 1L), .Label = c("1", "2"), class = "factor"), roomMin = structure(c(1L, 4L, 4L), .Label = c("10", "100", "150", "20", "255", "30", "40", "50", "60", "70", "Missing[NoInput]"), class = "factor"), roomMax = structure(c(6L, 9L, 8L), .Label = c("10", "100", "120", "150", "160", "20", "255", "30", "40", "50", "60", "70", "80", "90", "Missing[NoInput]"), class = "factor"), priceMin = c(60, 33, 73), priceMax = c(103, 46, 23), sizeMin = structure(c(1L, 5L, 5L), .Label = c("100", "125", "150", "250", "50", "75", "Missing[NoInput]"), class = "factor"), sizeMax = structure(c(1L, 2L, 1L), .Label = c("100", "125", "150", "250", "50", "75", "Missing[NoInput]"), class = "factor"), longitude = c(8.3015, 7.42765, 7.6104), latitude = c(47.05485, 46.9469, 46.75125), specificSearch = structure(c(1L, 1L, 1L), .Label = c("0", "1"), class = "factor"), objectType = structure(c(2L, 2L, 2L), .Label = c("1", "2", "3", "Missing[]"), class = "factor")), .Names = c("searchType", "roomMin", "roomMax", "priceMin", "priceMax", "sizeMin", "sizeMax", "longitude", "latitude", "specificSearch", "objectType"), row.names = c(60656L, 66897L, 130650L), class = "data.frame")

谢谢!

编辑:似乎出现错误/差异是因为数字列中有 NA，并且它们似乎被不同地对待。我如何才能使 NA 的 daisy 处理方式适应 gower.dist？

最佳答案

这是由于数据框的数字列中的 NA 值造成的。考虑以下代码，看看两个函数对于具有 NA 值的数字列的行为有何不同(daisy 比 gower.dist 更强大):

df1 <- rbind(df,cent)
head(df1)
       searchType roomMin roomMax priceMin priceMax sizeMin sizeMax longitude latitude specificSearch objectType
112457       <NA>      20      30       NA       35      50     100   6.63060 46.52425              0       <NA>
94601           1      10      30       73       11      50      75   7.47195 46.95120              0          2
78273           1      10    <NA>       60        1      50      50   8.55620 47.37515              0          2
59172           1      30      50       29       62      75     150        NA       NA              0          2
117425          1      30      40       11       23      75     100   8.56900 47.39290              0          2
60656           1      10      20       60      103     100     100   8.30150 47.05485              0          2

# only use the numeric column priceMin (4th column) to compute the distance
class(df1[,4])
# [1] "numeric"
df2 <- df1[4]

# daisy output
as.matrix(daisy(df2,metric="gower")) 
        112457     94601     78273      59172    117425     60656      66897    130650
112457      0        NA        NA         NA        NA        NA         NA        NA
94601      NA 0.0000000 0.2096774 0.70967742 1.0000000 0.2096774 0.64516129 0.0000000
78273      NA 0.2096774 0.0000000 0.50000000 0.7903226 0.0000000 0.43548387 0.2096774
59172      NA 0.7096774 0.5000000 0.00000000 0.2903226 0.5000000 0.06451613 0.7096774
117425     NA 1.0000000 0.7903226 0.29032258 0.0000000 0.7903226 0.35483871 1.0000000
60656      NA 0.2096774 0.0000000 0.50000000 0.7903226 0.0000000 0.43548387 0.2096774
66897      NA 0.6451613 0.4354839 0.06451613 0.3548387 0.4354839 0.00000000 0.6451613
130650     NA 0.0000000 0.2096774 0.70967742 1.0000000 0.2096774 0.64516129 0.0000000

# gower.dist output
gower.dist(df2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
[2,]  NaN    0    0    0    0    0    0    0
[3,]  NaN    0    0    0    0    0    0    0
[4,]  NaN    0    0    0    0    0    0    0
[5,]  NaN    0    0    0    0    0    0    0
[6,]  NaN    0    0    0    0    0    0    0
[7,]  NaN    0    0    0    0    0    0    0
[8,]  NaN    0    0    0    0    0    0    0

使用 gower.dist 函数中的参数 rngs 修复此问题:

gower.dist(df2, rngs=max(df2, na.rm=TRUE) - min(df2, na.rm=TRUE))
     [,1]      [,2]      [,3]       [,4]      [,5]      [,6]       [,7]      [,8]
[1,]  NaN       NaN       NaN        NaN       NaN       NaN        NaN       NaN
[2,]  NaN 0.0000000 0.2096774 0.70967742 1.0000000 0.2096774 0.64516129 0.0000000
[3,]  NaN 0.2096774 0.0000000 0.50000000 0.7903226 0.0000000 0.43548387 0.2096774
[4,]  NaN 0.7096774 0.5000000 0.00000000 0.2903226 0.5000000 0.06451613 0.7096774
[5,]  NaN 1.0000000 0.7903226 0.29032258 0.0000000 0.7903226 0.35483871 1.0000000
[6,]  NaN 0.2096774 0.0000000 0.50000000 0.7903226 0.0000000 0.43548387 0.2096774
[7,]  NaN 0.6451613 0.4354839 0.06451613 0.3548387 0.4354839 0.00000000 0.6451613
[8,]  NaN 0.0000000 0.2096774 0.70967742 1.0000000 0.2096774 0.64516129 0.0000000

因此，当数值变量中存在 NA 时，使 gower.dist 函数像 daisy 一样工作的方法可以如下所示:

df1 <- rbind(df,cent)

# compute the ranges of the numeric variables correctly
cols <- which(sapply(df1, is.numeric))
rngs <- rep(1, ncol(df1))
rngs[cols] <- sapply(df1[cols], function(x) max(x, na.rm=TRUE) - min(x, na.rm=TRUE)) 

daisyDist <- as.matrix(daisy(df1,metric="gower"))
gowerDist <- gower.dist(df1)

daisyDist
          112457     94601     78273     59172    117425     60656     66897    130650
112457 0.0000000 0.3951059 0.6151851 0.7107843 0.6397059 0.6424374 0.3756990 0.1105551
94601  0.3951059 0.0000000 0.2355126 0.5788530 0.5629176 0.4235379 0.3651002 0.2199324
78273  0.6151851 0.2355126 0.0000000 0.5122549 0.4033046 0.3500130 0.3951874 0.3631533
59172  0.7107843 0.5788530 0.5122549 0.0000000 0.2969639 0.5446623 0.4690421 0.5657812
117425 0.6397059 0.5629176 0.4033046 0.2969639 0.0000000 0.4638003 0.4256891 0.4757460
60656  0.6424374 0.4235379 0.3500130 0.5446623 0.4638003 0.0000000 0.5063082 0.4272755
66897  0.3756990 0.3651002 0.3951874 0.4690421 0.4256891 0.5063082 0.0000000 0.2900150
130650 0.1105551 0.2199324 0.3631533 0.5657812 0.4757460 0.4272755 0.2900150 0.0000000

gowerDist
          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]
[1,] 0.0000000 0.3951059 0.6151851 0.7107843 0.6397059 0.6424374 0.3756990 0.1105551
[2,] 0.3951059 0.0000000 0.2355126 0.5788530 0.5629176 0.4235379 0.3651002 0.2199324
[3,] 0.6151851 0.2355126 0.0000000 0.5122549 0.4033046 0.3500130 0.3951874 0.3631533
[4,] 0.7107843 0.5788530 0.5122549 0.0000000 0.2969639 0.5446623 0.4690421 0.5657812
[5,] 0.6397059 0.5629176 0.4033046 0.2969639 0.0000000 0.4638003 0.4256891 0.4757460
[6,] 0.6424374 0.4235379 0.3500130 0.5446623 0.4638003 0.0000000 0.5063082 0.4272755
[7,] 0.3756990 0.3651002 0.3951874 0.4690421 0.4256891 0.5063082 0.0000000 0.2900150
[8,] 0.1105551 0.2199324 0.3631533 0.5657812 0.4757460 0.4272755 0.2900150 0.0000000

关于R - 不同的结果 gower.dist 和 daisy(...,metric ="gower")，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40264815/

文章推荐： java - fragment 中的 RecyclerView 无法正常工作

文章推荐： javascript - Windows 上的 RequireJs 优化失败

python - 比较两个图像的 Daisy 描述符
我正在关注http://scikit-image.org/docs/0.11.x/auto_examples/plot_daisy.html ，但是不太清楚 desc[0],desc[1] 和 des
java - Daisy CMS 构建时遇到问题
在阅读了整个 SVN 的 README.txt 之后，我已经遵循了它们，但仍然缺少构建 Fortify Sourceanalyzer 所需的一些依赖项。这些依赖项是 DocumentEdit
Python skimage daisy 不同大小的特征向量
我正在使用 skimage 和 sklearn 来训练数据集 food101 的图像分类器 def process_image(image_fp): image_ = imread(image_fp)
tailwind-css - 如何在 daisy-ui 中自定义主题？
我想在daisyui中自定义一个主题。是否可以自定义，即深色主题(只需修复一种颜色，或添加更多颜色条目)？进一步:是否可以为您的自定义主题添加新的颜色条目？即我尝试了以下但没有成功: dais
r - 确定最佳簇数并使用 Daisy 函数和 Gower 相似度
我正试图将 250 个物种的行为特征归纳为生活史策略。特征数据由数值变量和名义变量组成。我对 R 和聚类分析比较陌生，但我相信找到这些点的距离的最佳选择是在雏菊函数中使用 gower 相似性方法。 1
R 的集群包中 daisy() 的 Python 等价物
我有一个包含分类(名义和有序)和数值属性的数据集。我想使用这些混合属性计算我的观察结果的(不)相似性矩阵。使用 daisy() R中集群包的功能，我可以很容易地得到一个相异矩阵如下: if(!requ
concurrency - 看懂代码 - Go 并发模式 : Daisy Chain
我正在研究 Go 并发模式。我不确定的一种模式是: Daisy Chain https://talks.golang.org/2012/concurrency.slide#39 我很难理解代码的控制
ios - daisy NSOperation main 变成 super 是否可以接受？
- (void)main { IDBAssert0(self.bestCapture.webpCandidate); self.finished = NO; self.executing = YES;
R - 不同的结果 gower.dist 和 daisy(...,metric ="gower")
我想计算两个数据帧的行之间的距离(相异性)，以便为每个观察找到最接近的簇。因为我有因子和数值变量，所以我使用高尔距离。由于我想比较两个数据帧(而不是一个矩阵的行之间的差异)， gower.dist 将
cluster-analysis - 使用 Daisy 获取 "invalid type character"错误
我有一个包含混合数据类型(整数、字符和逻辑)的数据框，我试图将其与 Daisy 聚类。我正在使用: gower_dist <- daisy(relchoice, metric = "gower")
animation - 我如何在 iPhone 中创建 Pin Wheel(如 Daisy Wheel)
我想创建一个风车(如 msnbc 应用程序)。请看下图。我如何创建一个风车。是否有可用的示例代码或教程？请指导我如何实现这一点？ Sample Image http://www.freeimageho
R 集群包错误 Daisy() 函数长向量(参数 11)在 .C 中不受支持
尝试使用集群中的daisy函数将具有数字、标称和NA值的data.frame转换为相异矩阵我的目的是在应用 k 均值聚类进行客户分割之前创建一个相异矩阵。 data.frame 有 133,153 行
file - 将 DAISY 文件转换为 PDF 或 PDF 或 Word 文档
我有一个 DAISY zipper 从 OpenLibrary.org 下载的文件。如何将其转换为 pdf/epub/word 文档？最佳答案我认为你很困惑。 DAISY 书籍主要是有声读物。来
linux - 为 beaglebone black 将 Robert Nelson 的 Linux 内核构建到 Yocto(daisy) 中
我试图从官方存储库为 beaglebone black 构建 Linux https://github.com/beagleboard/linux 我能够获取并运行 menuconfig，但是当我尝试

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

R - 不同的结果 gower.dist 和 daisy(...,metric ="gower")