gpt4 book ai didi

Python pandas 计算字符串中 Regex 匹配项的数量

转载 作者:太空宇宙 更新时间:2023-11-04 03:15:36 26 4
gpt4 key购买 nike

我有一个包含句子的数据框和一个按主题分组的术语词典,我想在其中计算每个主题的术语匹配数。

import pandas as pd

terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})

到目前为止,我已经为主题创建了列,如果通过遍历字典出现一个词,我将其标记为 1。

df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])


for k, v in terms.items():
for val in v:
df.loc[df.Foo.str.contains(val), k] = 1


print (df)

我得到:

>>> 
Foo Score animals games \
0 The quick brown fox was playing basketball today 4 1 1
1 John and Rob visited the eagles nest, the foxe... 6 1 NaN
2 Bill smells like a wet dog 2 NaN NaN
3 Steve threw the football at a deer. But the fo... 7 1 1
4 Sheriff John does not like hockey 8 NaN 1

people
0 NaN
1 1
2 NaN
3 1
4 1

计算句子中出现的每个主题的单词数的最佳方法是什么?是否有更有效的方式在不使用 cython 的情况下遍历字典?

最佳答案

您可以使用 splitstack什么比 Counter 解决方案快 5 倍:

df1 = df.Foo.str.split(expand=True).stack()
.reset_index(level=1, drop=True)
.reset_index(name='Foo')

for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1

print df1.groupby('index').sum().astype(int)
games animals people
index
0 1 1 0
1 0 2 2
2 0 0 0
3 2 1 1
4 1 0 1

时间:

In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop

In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop

代码:

def a(df):
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
return df1.groupby('index').sum().astype(int)

def b(df):
from collections import Counter

df1 = pd.DataFrame(terms)

res = []
for i,r in df.iterrows():
s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
res.append(pd.DataFrame(s).T)
return pd.concat(res)

关于Python pandas 计算字符串中 Regex 匹配项的数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36401422/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com