gpt4 book ai didi

python - 为什么 DataFrame.loc[[1]] 比 df.ix [[1]] 慢 1,800 倍,比 df.loc[1] 慢 3,500 倍?

转载 作者:太空狗 更新时间:2023-10-29 18:26:28 31 4
gpt4 key购买 nike

自己试试看:

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

更新:大概是2014年8月左右在0.15.1中引入的a legitimate bug in pandas。解决方法:使用旧版本的 pandas 等待新版本发布;得到一个尖端的开发者。来自github的版本;在您发布的 pandas 中手动进行一行修改;暂时使用 .ix 而不是 .loc

我有一个包含 480 万行的 DataFrame,使用 .iloc[[ id ]](带有单元素列表)选择单行需要 489 毫秒,将近半秒,比相同的方法慢 1,800 倍.ix[[ id ]] ,并且比 .iloc[id]3,500 倍(将 id 作为值而不是列表传递)。公平地说,无论列表的长度如何,.loc[list] 花费的时间都差不多,但我不想在上面花费 489 毫秒,尤其是当 .ix 快一千倍,并且产生相同的结果时结果。我的理解是 .ix 应该更慢,不是吗?

我正在使用 Pandas 0.15.1。关于 Indexing and Selecting Data 的优秀教程表明 .ix 在某种程度上比 .loc.iloc 更通用,而且可能更慢。具体来说,它说

However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

这是一个带有基准测试的 iPython session :

    print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
print 'df.index begins with ', df.index[:20]
print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())

# First extract one element directly. Expected result, no issues here.
id=5965356
print 'Extract one element with id %d' % id
%timeit df.loc[id]
%timeit df.ix[id]
print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result

# Now extract this one element as a list.
%timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
%timeit df.ix[[id]]
print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]])) # this one should be True
# Let's double-check that in this case .ix is the same as .loc, not .iloc,
# as this would explain the difference.
try:
print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
except:
print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))

# Finally, for the sake of completeness, let's take a look at iloc
%timeit df.iloc[3456789] # this is still 100+ times faster than the next version
%timeit df.iloc[[3456789]]

输出:

The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop

最佳答案

Pandas 索引非常慢,我切换到 numpy 索引

df=pd.DataFrame(some_content)
# takes forever!!
for iPer in np.arange(-df.shape[0],0,1):
x = df.iloc[iPer,:].values
y = df.iloc[-1,:].values
# fast!
vals = np.matrix(df.values)
for iPer in np.arange(-vals.shape[0],0,1):
x = vals[iPer,:]
y = vals[-1,:]

关于python - 为什么 DataFrame.loc[[1]] 比 df.ix [[1]] 慢 1,800 倍,比 df.loc[1] 慢 3,500 倍?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27596832/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com