I'm working through Python for Data Analysis, and I'm having problems with part of the Ch. 9 (Data Aggregation and Group Operations) section on "Grouping with Functions."
我正在使用Python进行数据分析,但我遇到了部分CH的问题。9(数据聚合和分组操作)部分,介绍“使用函数分组”。
Specifically, if I use the GroupBy object methods or, e.g., Numpy-defined functions, everything works fine. In particular, it ignores columns with strings and only operates on the (appropriate) numeric columns. However, if I try to define my own function to calculate some numeric output, it does not ignore the columns with strings, and it returns an Attribute Error.
具体地说,如果我使用GroupBy对象方法或Numpy定义的函数,一切都会正常工作。特别是,它忽略带有字符串的列,并且只对(适当的)数字列进行操作。但是,如果我尝试定义自己的函数来计算一些数字输出,它不会忽略带有字符串的列,并且会返回一个属性错误。
Here's the example I'm having trouble with:
下面是我遇到麻烦的一个例子:
df = DataFrame({'data1':np.random.randn(5),
'data2':np.random.randn(5),
'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one']})
It works fine if I type either of these (I have numpy imported as np):
如果我输入这两项中的任何一项,它都可以正常工作(我已经将NumPy导入为np):
df.groupby('key1').mean()
or
或
grouped = df.groupby('key1')
grouped.agg(np.mean())
But if I try these, I get errors ('peak_to_peak' is from the book):
但如果我尝试这些方法,我会得到错误信息(书中的“Peak_to_Peak”是这样的):
def peak_to_peak(arr):
return arr.max() - arr.min()
grouped.agg(peak_to_peak)
grouped.agg(lambda x: np.mean(x))
Trying 'peak_to_peak' gives me a big, long error that ends with:
尝试使用“Peak_to_Peak”会给出一个很大很长的错误,错误的结尾是:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Trying the lambda function with np.mean() gives me a big long error that ends with:
尝试将lambda函数与np.ean()一起使用时,我会得到一个很大的长期错误,以:
TypeError: Could not convert onetwoone to numeric
Trying other user-defined functions produces similar errors. In all these cases, it's pretty clearly trying to apply peak_to_peak() or np.mean() (or whatever) to the (subsets of the) 'key2' column from df, whereas for the built-in methods and predefined functions, it (correctly) ignores the 'key2' column subsets.
尝试其他用户定义函数会产生类似的错误。在所有这些情况下,很明显,它试图对df中的‘key2’列(子集)应用Peak_to_Peak()或np.ean()(或其他任何东西),而对于内置方法和预定义函数,它(正确地)忽略了‘key2’列子集。
Any insights would be appreciated.
任何真知灼见都将不胜感激。
Update: It turns out if I pass 'peak_to_peak' or the lambda function as lists (e.g., grouped.agg([peak_to_peak])), it works fine. Note that this is not how it's presented in the book, nor are lists required for predefined functions. So, it's still confusing, but at least it's functional, I guess.
更新:事实证明,如果我以列表(例如,grouped.agg([Peak_to_Peak]))的形式传递‘PEAK_TO_PEAK’或lambda函数,它就能很好地工作。请注意,这不是书中介绍的方式,预定义函数也不需要列表。所以,它仍然令人困惑,但至少我想它是有功能的。
更多回答
What version of pandas are you using? On the latest master for .agg(lambda x: np.mean(x))
I get NaNs back in the key2 column. The documentation on agg
doesn't mention this at all, and it should. Care to open an issue on github about this?
你用的是什么版本的熊猫?在.agg(lambda x:np.ean(X))的最新主服务器上,我在key2列中得到了nans。关于AGG的文档根本没有提到这一点,它应该提到这一点。介意在GitHub上就这一点打开一个问题吗?
I've got pandas 0.13.1 (and numpy 1.7.1 and python 2.7.6, for what those are worth). I didn't see any NaNs in mine... I'll look into opening an issue on github. Thanks for the response.
我有熊猫0.13.1(还有NumPy 1.7.1和Python2.7.6,这些都值)。我在我的车里没看到任何内裤。我会考虑在GitHub上开设一期。感谢您的回复。
This was a regression from prior to 0.13, not sure exactly when (the book is based on about 0.10 IIRC); fixed here. github.com/pydata/pandas/pull/6338; It should essentially ignore that column (and was just not catching the error)
这是从0.13之前的回归,不确定确切的时间(本书基于大约0.10 IIRC);已在此处修正。Githorb.com/pydata/pandas/ull/6338;它基本上应该忽略该列(只是没有捕捉到错误)
我是一名优秀的程序员,十分优秀!