gpt4 book ai didi

python - 如何对数据框中的列进行分组,其中包含包含元组列表的列

转载 作者:行者123 更新时间:2023-11-28 20:56:01 24 4
gpt4 key购买 nike

我正在尝试按其中一列“类别”中的值对我的数据框进行分组。虽然,其他列之一“prob”包含每一行的元组列表。当我尝试按“类别”分组时,“概率”列消失了。

我目前的 df:

 category          other:          prob:
one val [(hi, hello), (jimbob, joe)]
one val2 [(this, not), (is, work), (now, any)]
two val2 [(bob, jones), (work, here)]
three val3 [(milk, coffee), (tea, bread)]
two val3 [(money, here), (job, money)]

预期输出:

 category:           other:         prob:
one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is, work), (now, any)]
two val2, val3 [(bob, jones), (work, here), (money, here), (job, money)]
three val3 [(money, here), (job, money)]

最好的方法是什么?抱歉,如果我对这个问题的表述有误,如果您有任何问题,请告诉我。谢谢!

最佳答案

您可以通过 GroupBy.agg 聚合数据将 join 用于字符串列并展平元组数据 - 添加了 3 个解决方案,sum 仅在小数据和性能不重要时使用:

import functools
import operator

from itertools import chain

f = lambda x: [z for y in x for z in y]
#faster alternative
#f = lambda x: list(chain.from_iterable(x))
#faster alternative2
#f = lambda x: functools.reduce(operator.iadd, x, [])
#slow alternative
#f = lambda x: x.sum()
df = df.groupby('category', as_index=False).agg({'other':', '.join, 'prob':f})

print (df)
category other prob
0 one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is,...
1 three val3 [(milk, coffee), (tea, bread)]
2 two val2, val3 [(bob, jones), (work, here), (money, here), (j...

性能:

pic

测试代码:

np.random.seed(2019)

import perfplot
import functools
import operator

from itertools import chain


default_value = 10

def iadd(df1):
f = lambda x: functools.reduce(operator.iadd, x, [])
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)

def listcomp(df1):
f = lambda x: [z for y in x for z in y]
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)

def from_iterable(df1):
f = lambda x: list(chain.from_iterable(x))
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)

def sum_series(df1):
f = lambda x: x.sum()
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_cat(df1):
d = {'other':lambda x: x.str.cat(sep=', '), 'prob':'sum'}
return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_join(df1):
d = {'other': ', '.join, 'prob': 'sum'}
return df1.groupby('category', as_index=False).agg(d)


def make_df(n):
a = np.random.randint(0, n / 10, n)
b = np.random.choice(list('abcdef'), len(a))
c = [tuple(np.random.choice(list(string.ascii_letters), 2)) for _ in a]
df = pd.DataFrame({"category":a, "other":b, "prob":c})
df1 = df.groupby(['category','other'])['prob'].apply(list).reset_index()
return df1

perfplot.show(
setup=make_df,
kernels=[iadd, listcomp, from_iterable, sum_series,sum_groupby_cat,sum_groupby_join],
n_range=[10**k for k in range(1, 8)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')

关于python - 如何对数据框中的列进行分组,其中包含包含元组列表的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55357754/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com