gpt4 book ai didi

python - 函数为通过的 pandaDF 列制作具有正态曲线的直方图

转载 作者:行者123 更新时间:2023-12-01 09:17:47 24 4
gpt4 key购买 nike

我想创建一个函数,它接受 df 和 col 并返回带有正态曲线和一些标签的直方图。我可以使用和自定义我认为适合 future 数据的东西(如果有任何建议使其更加可定制,我将不胜感激)。这是为kaggle titanic训练集制作的,如果需要,请从here下载。此函数对于没有 NaN 值的列运行良好。列 AgeNaN,我认为这是引发错误的原因。我尝试使用 Error when plotting DataFrame containing NaN with Pandas 0.12.0 and Matplotlib 1.3.1 on Python 3.3.2 忽略 NaN其中一个解决方案建议使用 subplot,但它对我不起作用;接受的解决方案是降级matplotlib(我的版本是'2.1.2',python是3.6.4)。这个pylab histogram get rid of nan使用了一种有趣的方法,但我无法将其应用于我的案例。如何删除 NaN ?这个功能可以自定义吗?不是主要问题 - 我可以巧妙地做诸如圆形平均值/标准差之类的事情,添加更多信息吗?

import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mydf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

def df_col_hist (df,col, n_bins):

fig, ax = plt.subplots()
n, bins, patches = ax.hist(df[col], n_bins, normed=1)

y = mlab.normpdf(bins, df[col].mean(), df[col].std())
ax.plot(bins, y, '--')

ax.set_xlabel (df[col].name)
ax.set_ylabel('Probability density')
ax.set_title(f'Histogram of {df[col].name}: $\mu={df[col].mean()}$, $\sigma={df[col].std()}$')

fig.tight_layout()
plt.show()

df_col_hist (train_data, 'Fare', 100)
#Works Fine, Tidy little histogram.

df_col_hist (train_data, 'Age', 100)
#ValueError: max must be larger than min in range parameter.

..\Anaconda3\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning: invalid value encountered in reduce
return umr_minimum(a, axis, None, out, keepdims)
..\Anaconda3\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning: invalid value encountered in reduce
return umr_maximum(a, axis, None, out, keepdims)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-75-c81b76c1f28e> in <module>()
----> 1 df_col_hist (train_data, 'Age', 100)

<ipython-input-70-1cf1645db595> in df_col_hist(df, col, n_bins)
2
3 fig, ax = plt.subplots()
----> 4 n, bins, patches = ax.hist(df[col], n_bins, normed=1)
5
6 y = mlab.normpdf(bins, df[col].mean(), df[col].std())

~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1715 warnings.warn(msg % (label_namer, func.__name__),
1716 RuntimeWarning, stacklevel=2)
-> 1717 return func(ax, *args, **kwargs)
1718 pre_doc = inner.__doc__
1719 if pre_doc is None:

~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(***failed resolving arguments***)
6163 # this will automatically overwrite bins,
6164 # so that each histogram uses the same bins
-> 6165 m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
6166 m = m.astype(float) # causes problems later if it's an int
6167 if mlast is None:

~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in histogram(a, bins, range, normed, weights, density)
665 if first_edge > last_edge:
666 raise ValueError(
--> 667 'max must be larger than min in range parameter.')
668 if not np.all(np.isfinite([first_edge, last_edge])):
669 raise ValueError(

最佳答案

您对 normpdf 的调用是错误的,因为它需要 x 值数组作为第一个参数,而不是 bin 的数量。但无论如何,mlab.normpdf 已被弃用。

也就是说,我建议使用 scipy 中的 norm.pdf:

from scipy.stats import norm

s = np.std(df[col])
m = df[col].mean()
x = np.linspace(m - 3*s, m + 3*s, 51)
y = norm.pdf(x, loc=m) # additionally there's a `scale` parameter for norming against whatever in y-direction

ax.plot(x, y, '--', label='probability density function')

PS:为了将 nan 放入 pandas 数据框中,您有

df[col].dropna()

即:

n, bins, patches = ax.hist(df[col].dropna(), n_bins, normed=1)

关于python - 函数为通过的 pandaDF 列制作具有正态曲线的直方图,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51082483/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com