gpt4 book ai didi

python - 查找 Pandas 系列中的关键字子集 (Python)

转载 作者:太空宇宙 更新时间:2023-11-03 14:05:14 25 4
gpt4 key购买 nike

我正在使用看起来非常像这样的系列:

l0 = ['smartphone', 'battery', 'case', 'grey', '10071852']
l1 = ['phone', 'new', 'charging', 'case', 'white']
l2 = ['tablet', 'phone', 'pin', 'adapter', 'ex766']
l3 = ['phone', 'silicon', 'case', 'brown']

mySeries = pd.Series([l0,l1,l2,l3])

print(mySeries)

0 [smartphone, battery, case, grey, 10071852]
1 [phone, new, charging, case, white]
2 [tablet, phone, pin, adapter, ex766]
3 [phone, silicon, case, brown]

我正在尝试搜索该系列的每一行(每个列表)中可能包含的关键字和关键字集。更具体地说,假设我想查找系列中的一行是否包含以下关键字:

simple_keywords = {'case', 'adapter'}

还要查找该系列是否包含以下关键字对:

double_keywords = {'battery case', 'charging case'}

寻找 simple_keywords,似乎很容易。但是,我也想查找这些对,并确保如果有像“电池盒”这样的对,我希望它返回关键字对,而不仅仅是“case”。

此外,我有一个如下所示的数据框:

d = {'Date': ['03/08/2014', '04/08/2014', '05/08/2014', '06/08/2014'], 'Product': ['none', 'none','none','none'],'Frequency': [5, 10, 1, 2]}
myDF = pd.DataFrame(data=d)

print(myDF)

Date Frequency Product
0 03/08/2014 5 none
1 04/08/2014 10 none
2 05/08/2014 1 none
3 06/08/2014 2 none

我的最终目标是在此数据框中(在产品列中)写入我在系列中确定的相应关键字(或关键字对)。系列的每一行对应于数据框中完全相同的行,这意味着顺序非常重要。我想查看2014年8月3日的产品“电池盒”的频率为5。

我尝试通过分隔关键字对来提出一些解决方案,但它似乎非常慢并且效率不高,因为我正在处理的系列中有超过 350,000 行(将其留了一夜而且还没有完成):

first_keywords = {'case', 'adapter'}
second_keywords = {'battery', 'charging'}

mySeries_range = len(mySeries)

for i in range(mySeries_range):
for x, y in [(x, y) for x in first_keywords for y in second_keywords]:
if x in mySeries[i] and y in mySeries[i]:
myDF.Product[i] = y + ' ' + x
elif x in mySeries[i] and y not in mySeries[i]:
myDF.Product[i] = x

我希望获得的最终结果是:

         Date  Frequency        Product
0 03/08/2014 5 battery case
1 04/08/2014 10 charging case
2 05/08/2014 1 adapter
3 06/08/2014 2 case

如果有人能帮助我那就太好了。如果我的代码不太漂亮,请道歉...努力变得更好!

最佳答案

您可以通过以下方式从 mySeries 列表中的单词中生成任意数量的组合:

import itertools
df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
for l in list(itertools.combinations(x,max_len))
]).rename(max_len)
for max_len in [1,2]],axis=1).astype(str)

这是结果:

>>> df_comb                                             1  \
0 [smartphone, battery, case, grey, 10071852]
1 [phone, new, charging, case, white]
2 [tablet, phone, pin, adapter, ex766]
3 [phone, silicon, case, brown]

2
0 [smartphone battery, smartphone case, smartpho...
1 [phone new, phone charging, phone case, phone ...
2 [tablet phone, tablet pin, tablet adapter, tab...
3 [phone silicon, phone case, phone brown, silic...

现在让我们将单词的字典做成一个列表,以便更容易迭代:

simple_keywords = ['case', 'adapter']
double_keywords = ['battery case', 'charging case']

然后你可以这样计算元素:

>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(' '.split(w))].rename(w) 
for w in simple_keywords],axis=1)
case adapter
0 1 0
1 1 0
2 0 1
3 1 0

>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(w.split(' '))].rename(w) for w in double_keywords],axis=1)

battery case charging case
0 1 0
1 0 1
2 0 0
3 0 0

或者我们可以这样迭代:

df_count = pd.DataFrame()
for list_of_keywords in [simple_keywords, double_keywords]:
df_count_temp = pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),
axis=0)[len(w.split(' '))].rename(w)
for w in list_of_keywords],axis=1)
df_count = pd.concat([df_count, df_count_temp],axis=1)

计数将是:

>>> df_count

case adapter battery case charging case
0 1 0 1 0
1 1 0 0 1
2 0 1 0 0
3 1 0 0 0

您可以通过以下方式获得最终计数:

>>> df_count.sum(axis=0).to_frame()

0
case 3
adapter 1
battery case 1
charging case 1

您可以创建一个函数来将其应用于每天的条目。

def my_func(mySeries, keywords =  [['case', 'adapter'] ,['battery case', 'charging case']]):
import itertools
keyword_lengths = [len(k[0].split(' ')) for k in keywords]
df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
for l in list(itertools.combinations(x,max_len))
]).rename(max_len)
for max_len in keyword_lengths],axis=1).astype(str)

df_count = pd.DataFrame()
for list_of_keywords in keywords:
df_count_temp = pd.concat([df_comb.apply(lambda x:pd.Series(x).str.count(w),
axis=0)[len(w.split(' '))].rename(w)
for w in list_of_keywords],axis=1)
df_count = pd.concat([df_count, df_count_temp],axis=1)

return df_count

想象这是您的 pd.Series:

>>> newSeries 
2014-03-08 [smartphone, battery, case, grey, 10071852]
2014-03-08 [phone, new, charging, case, white]
2014-03-08 [tablet, phone, pin, adapter, ex766]
2014-03-08 [phone, silicon, case, brown]
2014-04-08 [phone, new, charging, case, white]
2014-04-08 [tablet, phone, pin]
2014-04-08 [phone, adapter]
dtype: object



>>> my_func(newSeries)

case adapter battery case charging case
2014-03-08 1 0 1 0
2014-03-08 1 0 0 1
2014-03-08 0 1 0 0
2014-03-08 1 0 0 0
2014-04-08 1 0 0 1
2014-04-08 0 0 0 0
2014-04-08 0 1 0 0

然后您可以使用按日期分组返回的数据框并计算元素数。这样您就可以按日期获得出场次数:

>>> df_appearances= my_func(newSeries).reset_index().groupby('index'
).sum().T.unstack().reset_index()

>>> df_appearances.columns = ['Date', 'Product', 'Frequency']

>>> df_appearances

Date Product Frequency
0 2014-03-08 case 3
1 2014-03-08 adapter 1
2 2014-03-08 battery case 1
3 2014-03-08 charging case 1
4 2014-04-08 case 1
5 2014-04-08 adapter 1
6 2014-04-08 battery case 0
7 2014-04-08 charging case 1

关于python - 查找 Pandas 系列中的关键字子集 (Python),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48947557/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com