gpt4 book ai didi

pandas groupby agg function column/dtype error(PANDA GROUPAS BY AGG Function列/dtype错误)

转载 作者:bug小助手 更新时间:2023-10-24 23:45:13 29 4
gpt4 key购买 nike



I'm working through Python for Data Analysis, and I'm having problems with part of the Ch. 9 (Data Aggregation and Group Operations) section on "Grouping with Functions."

我正在使用Python进行数据分析,但我遇到了部分CH的问题。9(数据聚合和分组操作)部分,介绍“使用函数分组”。



Specifically, if I use the GroupBy object methods or, e.g., Numpy-defined functions, everything works fine. In particular, it ignores columns with strings and only operates on the (appropriate) numeric columns. However, if I try to define my own function to calculate some numeric output, it does not ignore the columns with strings, and it returns an Attribute Error.

具体地说,如果我使用GroupBy对象方法或Numpy定义的函数,一切都会正常工作。特别是,它忽略带有字符串的列,并且只对(适当的)数字列进行操作。但是,如果我尝试定义自己的函数来计算一些数字输出,它不会忽略带有字符串的列,并且会返回一个属性错误。



Here's the example I'm having trouble with:

下面是我遇到麻烦的一个例子:



df = DataFrame({'data1':np.random.randn(5),
'data2':np.random.randn(5),
'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one']})


It works fine if I type either of these (I have numpy imported as np):

如果我输入这两项中的任何一项,它都可以正常工作(我已经将NumPy导入为np):



df.groupby('key1').mean()


or



grouped = df.groupby('key1')

grouped.agg(np.mean())


But if I try these, I get errors ('peak_to_peak' is from the book):

但如果我尝试这些方法,我会得到错误信息(书中的“Peak_to_Peak”是这样的):



def peak_to_peak(arr):
return arr.max() - arr.min()

grouped.agg(peak_to_peak)

grouped.agg(lambda x: np.mean(x))


Trying 'peak_to_peak' gives me a big, long error that ends with:

尝试使用“Peak_to_Peak”会给出一个很大很长的错误,错误的结尾是:



TypeError: unsupported operand type(s) for -: 'str' and 'str'


Trying the lambda function with np.mean() gives me a big long error that ends with:

尝试将lambda函数与np.ean()一起使用时,我会得到一个很大的长期错误,以:



TypeError: Could not convert onetwoone to numeric


Trying other user-defined functions produces similar errors. In all these cases, it's pretty clearly trying to apply peak_to_peak() or np.mean() (or whatever) to the (subsets of the) 'key2' column from df, whereas for the built-in methods and predefined functions, it (correctly) ignores the 'key2' column subsets.

尝试其他用户定义函数会产生类似的错误。在所有这些情况下,很明显,它试图对df中的‘key2’列(子集)应用Peak_to_Peak()或np.ean()(或其他任何东西),而对于内置方法和预定义函数,它(正确地)忽略了‘key2’列子集。



Any insights would be appreciated.

任何真知灼见都将不胜感激。



Update: It turns out if I pass 'peak_to_peak' or the lambda function as lists (e.g., grouped.agg([peak_to_peak])), it works fine. Note that this is not how it's presented in the book, nor are lists required for predefined functions. So, it's still confusing, but at least it's functional, I guess.

更新:事实证明,如果我以列表(例如,grouped.agg([Peak_to_Peak]))的形式传递‘PEAK_TO_PEAK’或lambda函数,它就能很好地工作。请注意,这不是书中介绍的方式,预定义函数也不需要列表。所以,它仍然令人困惑,但至少我想它是有功能的。


更多回答

What version of pandas are you using? On the latest master for .agg(lambda x: np.mean(x)) I get NaNs back in the key2 column. The documentation on agg doesn't mention this at all, and it should. Care to open an issue on github about this?

你用的是什么版本的熊猫?在.agg(lambda x:np.ean(X))的最新主服务器上,我在key2列中得到了nans。关于AGG的文档根本没有提到这一点,它应该提到这一点。介意在GitHub上就这一点打开一个问题吗?

I've got pandas 0.13.1 (and numpy 1.7.1 and python 2.7.6, for what those are worth). I didn't see any NaNs in mine... I'll look into opening an issue on github. Thanks for the response.

我有熊猫0.13.1(还有NumPy 1.7.1和Python2.7.6,这些都值)。我在我的车里没看到任何内裤。我会考虑在GitHub上开设一期。感谢您的回复。

This was a regression from prior to 0.13, not sure exactly when (the book is based on about 0.10 IIRC); fixed here. github.com/pydata/pandas/pull/6338; It should essentially ignore that column (and was just not catching the error)

这是从0.13之前的回归,不确定确切的时间(本书基于大约0.10 IIRC);已在此处修正。Githorb.com/pydata/pandas/ull/6338;它基本上应该忽略该列(只是没有捕捉到错误)

优秀答案推荐

In the approach you use, you pass the columns as parameters to the function, one by one with all values. However, since there are non-numeric values in the key2 column, subtraction cannot be performed between two strings.

在您使用的方法中,将列作为参数传递给函数,一个接一个地传递所有值。但是,由于Key2列中存在非数字值,因此不能在两个字符串之间执行减法。


You can solve your problem as follows:

您可以通过以下方式解决您的问题:


grouped[["data1", "data2"]].agg(peak_to_peak)

grouped[["data1", "data2"]].agg(lambda x: np.mean(x))`

更多回答

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com