python - 为什么 stat_density (R; ggplot2) 和 gaussian

python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？

转载作者：太空宇宙更新时间：2023-11-04 08:28:02

25

4

我正在尝试对可能不是正态分布的一系列分布生成基于 KDE 的 PDF 估计。

我喜欢 R 中 ggplot 的 stat_density 似乎可以识别频率中的每一个增量颠簸的方式，但无法通过 Python 的 scipy-stats-gaussian_kde 方法复制它，这似乎过于平滑。

我已按如下方式设置我的 R 代码:

ggplot(test, aes(x=Val, color = as.factor(Class), group=as.factor(Class))) +
             stat_density(geom='line',kernel='gaussian',bw='nrd0' 
                                                            #nrd0='Silverman'
                                                            ,size=1,position='identity')

我的 python 代码是:

kde = stats.gaussian_kde(data.ravel())
kde.set_bandwidth(bw_method='silverman')

统计文档显示 here nrd0 是 bw 调整的 silverman 方法。

基于上面的代码，我使用了相同的内核(高斯)和带宽方法(Silverman)。

谁能解释为什么结果如此不同？

最佳答案

关于什么是西尔弗曼法则似乎存在分歧。 TL;DR - scipy 使用了一个更糟糕的规则版本，它只适用于正态分布的单峰数据。 R 使用了一个更好的版本，它是“两全其美”并且“适用于各种密度”。

scipy docs说 Silverman 的规则是 implemented as :

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

其中 d 是维数(在您的情况下为 1)，neff 是有效样本大小(点数，假设没有权重)。所以 scipy 带宽是 (n * 3/4) ^ (-1/5)(乘以标准偏差，以不同的方法计算)。

相比之下，R 的 stats package docs将 Silverman 的方法描述为“0.9 乘以标准偏差和四分位间距的最小值除以样本量的 1.34 倍的负五分之一方”，这也可以在 R 代码中验证，键入 bw.nrd0 在控制台中给出:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

Wikipedia ，另一方面，将“Silverman 的经验法则”作为估算器的许多可能名称之一:

1.06 * sigma * n ^ (-1 / 5)

维基百科版本相当于scipy版本。

所有三个来源(scipy 文档、维基百科和 R 文档)都引用了相同的原始引用资料:Silverman、B.W. (1986)。 统计和数据分析的密度估计。伦敦:Chapman & Hall/CRC。 p. 48. 国际标准书号 978-0-412-24620-3。维基百科和 R 特别引用了第 48 页，而 scipy 的文档没有提到页码。 (我已经向维基百科提交了一个编辑，以将其页面引用更新为 p.45，见下文。)

阅读 Silverman 论文，第 45 页，方程 3.28 是维基百科文章中使用的:(4/3) ^ (1/5) * sigma * n ^ (-1/5) ~= 1.06 * 西格玛 * n ^ (-1/5)。 Scipy 使用相同的方法，将 (4/3) ^ (1/5) 重写为等效的 (3/4) ^ (-1/5)。 Silverman 描述了这种方法:

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

scipy 文档 reference this weakness ，说明:

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

但是，Silverman 的文章继续改进了 scipy 使用的方法以获取 R 和 Stata 使用的方法。在第 48 页，我们得到等式 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman 将此方法描述为:

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

因此，Wikipedia 和 Scipy 似乎使用了 Silverman 提出的具有已知弱点的估算器的简单版本。 R 和 Stata 使用更好的版本。

关于python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55366188/

25

4

0

文章推荐： java - 如何使用java在谷歌应用程序引擎中创建登录页面？

文章推荐： c - 关于随机整数生成的准则(C)

文章推荐： java - 有关 ProGuard、Launch4j 的详细信息？

文章推荐： c++ - 释放嵌入式应用程序中的内存无助于减少虚拟存储

Python ggplot 和 ggplotly
前 R 用户，我曾经通过 ggplotly() 函数广泛地结合 ggplot 和 plot_ly 库来显示数据。刚到 Python 时，我看到 ggplot 库可用，但在与 plotly 的简单组合
r - ggplotly 从 ggplot 中删除图例
ggplotly 使用 ggplot 删除 geom_line 图的图例。见例如以下: library(plotly) g % ggplotly() 关于r - ggplotly 从 gg
r - 设置带有端点的 ggplot 网格线/ggplot 的中断计算
我有一个 ggplot我试图以非常简约的外观制作线图的问题。我已经摆脱了图例，转而使用每行右侧的文本标签。如果标签不是那么长，它可能不会那么明显，但如果网格线停在最大 x 值(在这种情况下，在 201
r - 在一个 ggplot() 中生成多个 ggplot 图形
我想使用相同的 ggplot 代码以我的数据框中的数字为条件生成 8 个不同的数字。通常我会使用 facet_grid，但在这种情况下，我希望最终得到每个单独数字的 pdf。例如，我想要这里的每一行一
r - ggplot : conflict between geom_text and ggplot(fill)
当我在 ggplot 上使用 geom_text 时，与 ggplot 的“填充”选项发生冲突。这是问题的一个明显例子: library(ggplot2) a=ChickWeight str(a)
r - 将 ggplotly 和 ggplot 与拼凑而成？
是否可以结合使用 ggplot ly 和拼凑而成的ggplot？例子这将并排显示两个图 library(ggplot2) library(plotly) library(patchwork) a
r - ggplot、ggplotly、scale_y_连续、ylim 和百分比
我想绘制一个图表，其中 y 轴以百分比表示: p = ggplot(test, aes(x=creation_date, y=value, color=type)) + geom_line(aes
R ggplot，删除 ggsave/ggplot 中的白边
如何去除ggsave中的白边距？我的问题和Remove white space (i.e., margins) ggplot2 in R一模一样。然而，那里的答案对我来说并不理想。我不想对固定但未知
r - 文本层在 ggplot 中工作，但用 ggplotly 删除
我有一个带有一些文本层的条形图，在 ggplot 库中一切正常，但现在我想添加一些与 ggplotly 的交互性，但它无法显示文本层我更新了所有软件包但问题仍然存在 df = read.table(
r - ggplot 到 ggplotly 不适用于自定义的 geom_boxplot 宽度
当我尝试在 ggplot 中为我的箱线图设置自定义宽度时，它工作正常: p=ggplot(iris, aes(x = Species,y=Sepal.Length )) + geom_boxplot(
r - 如何通过从 ggplot 中的不同数据帧映射 aes_string 在 ggplot 中生成图例？
我正在尝试为 ggplot 密度创建一个图例，将一个组与所有组进行比较。使用此示例 - R: Custom Legend for Multiple Layer ggplot - 我可以使用下面的代码成
r - ggplot 在多面图上有一些错误。尝试使用多面 ggplot 协调 y 值
所以我试图在一个多面的 ggplot 上编辑 y 值，因为我在编织时在情节上有几个不准确之处。我对 R 和 R Markdown 很陌生，所以我不太明白为什么，例如，美国的 GDP PPP 在美元金额
python-ggplot - 如何在 Python Ggplot 上格式化 x 轴？
我需要在 python 条形图的 x 轴 ggplot 上格式化日期。我该怎么做？最佳答案使用 scale_x_date() 格式化 x 轴上的日期。 p = ggplot(aes(x='dat
r - 为什么 ggplotly 在 rmarkdown 中不能像 ggplot 一样工作
我想使用 ggplotly因为它的副作用相同ggplot甚至graphics做。我的意思是当我 knitr::knit或 rmarkdown::render我期望的 Rmd 文档 print(obj)
r - 在 Shiny 的应用程序中显示 ggplot 时，如何捕获控制台中出现的 ggplot 警告并显示在应用程序中？
我在下面有一个简单的应用程序，它显示了一个 ggplot。 ggplot 在控制台中生成警告(见底部图片)。我想捕获警告，并将其显示在应用程序的情节下方。这是我的代码: library(shiny)
r - 在 Shiny 的应用程序中缓存基本 ggplot 并允许动态修改图层(与 ggplot 等效的leafletProxy)
如果显示的基本数据集很大(下面的示例工作代码)，则在 Shiny 的应用程序中向/从 ggplot 添加/删除图层可能需要一段时间。问题是: 有没有办法缓存 ggplot(基本图)并添加/删除/修改
r - ggplot 和网格 : Find the relative x and y positions of a point in a ggplot grob
我正在组合 ggplot 的多个绘图，使用网格视口(viewport)，这是必要的(我相信)，因为我想旋转绘图，这在标准 ggplot 中是不可能的，甚至可能是 gridExtra 包。我想在两个图
R中的相对频率直方图，ggplot
我可以使用 lattice 在 R 中绘制相对频率直方图包裹: a <- runif(100) library(lattice) histogram(a) 我想在 ggplot 中获得相同的图形.我试
ggplot geom_area的R堆叠区域顺序
我需要重新安装 R，但我现在遇到了 ggplot 的一个小问题。我确信有一个简单的解决方案，我感谢所有提示! 我经常使用堆叠面积图，通常我通过定义因子水平并以相反的顺序绘制来获得所需的堆叠和图例顺序。
ggplot 中的数据重新排序
新的并且坚持使用ggplot: 我有以下数据: tribe rho preference_watermass 1 Luna2 -1.000 hypolimnic 2 OP10I-A1

首页

博学

6Ren·AI

商城

python - 为什么 stat_density (R; ggplot2) 和 gaussian_kde (Python; scipy) 不同？