gpt4 book ai didi

python - 如何通过一次遍历数据框来有效地计算行数

转载 作者:太空宇宙 更新时间:2023-11-04 03:06:02 27 4
gpt4 key购买 nike

我有一个由这样的字符串组成的数据框:

ID_0 ID_1
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h

对于每对字符串,我可以计算其中有多少行包含任一字符串,如下所示。

import pandas as pd
import itertools

df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])

alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])

这给出:

a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]

问题是我的数据框非常大,字母表的大小为 200,此方法对每对字母独立遍历整个数据框。

是否可以通过某种方式对数据帧进行单次传递来获得相同的输出?


时间

我创建了一些数据:

import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])

#Testing Parfait's answer
def f(row):
ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
(df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
return(ser)

%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop

I would like to be able to do this for n = 10**8. Can this be sped up?

最佳答案

您可以通过使用一些巧妙的组合学/集合论来进行计数来超越行级子迭代:

# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()

# Get the counts.
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]

结果输出:

  ID_0 ID_1  count
0 g k 1
1 a h 4
2 c i 4
3 j e 1
4 d i 6
5 i h 6
6 b b 1
7 d d 3
8 i a 5
9 d h 5

我已经将我的方法的输出与具有 5000 行的数据集上的行级迭代方法进行了比较,并且所有计数都匹配。

为什么会这样?它基本上只依赖于计算两个集合并集的公式:set_union_equation

给定元素的基数就是 char_count。当元素不同时,交集的基数就是任意顺序的元素对的计数。请注意,当两个元素相同时,公式将简化为仅 char_count

时间

使用问题中的计时设置,以及我的答案中的以下功能:

def root(df):
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
return df

n=10**4 的时间安排如下:

%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop

%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop

n=10**6 的时间如下:

%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop

看来我的解决方案近似线性扩展。

关于python - 如何通过一次遍历数据框来有效地计算行数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39437504/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com