Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
有时,当我创建一个直方图时,如果使用seborn的disploy函数,当NOROM_HIST=True时,y轴小于1,这是PDF的预期结果。其他时候,它的值大于1。
For example if I run
例如,如果我运行
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
然后,直方图上的y轴像预期的那样从0.0到0.4,但如果数据不正常,即使Norm_Hist=True,y轴也可能大到30。
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
关于直方图函数的归一化参数,例如SNs.disploy的NOMANM_HIST,我遗漏了什么?即使我自己通过创建一个新变量来标准化数据,如下所示:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
因此,数据总和为1,无论norm_hist参数是否为True,y轴仍将显示远大于1的值(例如30)。
What interpretation can I give when the y-axis has such a large range?
当y轴有这么大的范围时,我能给出什么解释呢?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
我认为正在发生的情况是,我的数据非常集中在零附近,所以为了使数据的面积等于1(例如,在KDE下),直方图的高度必须大于1……但是,既然概率不能大于1,那么结果意味着什么?
Also, how can I get these functions to show probability on the y-axis?
另外,如何才能使这些函数在y轴上显示概率?
更多回答
优秀答案推荐
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
规则并不是所有的酒吧都应该是一个。规则是所有条形图的所有面积之和应为1。当条形非常窄时,尽管它们的面积之和为1,但它们的和可以相当大。条形图的高度乘以其宽度是值都在该范围内的概率。为了使高度等于概率,你需要宽度为1的条形图。
Here is an example to illustrate what's going on.
这里有一个例子来说明正在发生的事情。
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
np.random.seed(2023)
a = np.random.normal(0, 0.01, 100000)
sns.histplot(a, bins=np.arange(-0.04, 0.04, 0.001), stat='density', ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[1][40].set_color('r')
a *= 1000
sns.histplot(a, bins=np.arange(-40, 40, 1), stat='density', ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[1][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001
meter wide. The highest bin (in red) is about 40
high. The probability that a value falls into that bin is 40*0.001 = 0.04
.
左边的图使用0.001米宽的箱。最高的箱子(红色)大约是40高。值落入该bin的概率为40*0.001 = 0.04。
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm
wide. The highest bin is about 0.04
high. The probability that a value falls into that bin is also 0.04
, because of the bin width of 1
.
右边的曲线图使用的是完全相同的数据,但测量的单位是毫米级。现在垃圾箱的宽度是1毫米。最高仓位约0.04高。由于箱宽为1,因此值落入该箱内的概率也是0.04。
As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3
.
作为概率密度函数具有大于1的区域的分布的示例,请参见α=3的帕累托分布。
By directly using plt.hist
, which returns the bin edges and heights, the area can easily be calculated.
通过直接使用plt.hist(返回面元边缘和高度),可以轻松计算面积。
np.random.seed(2023)
a = np.random.normal(0, 0.01, 100000)
v = plt.hist(a, bins=np.arange(-0.04, 0.04, 0.001), density=True, ec='k')
left = v[1][:-1]
right = v[1][1:]
area = (v[0] * (right-left)).sum()
print(f'Area: {area}')
sns.distplot
is deprecated
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
更多回答
我是一名优秀的程序员,十分优秀!