gpt4 book ai didi

python - pandas DataFrame 多个子串匹配,还将一行的特定匹配子串放入新列

转载 作者:行者123 更新时间:2023-11-28 18:10:40 25 4
gpt4 key购买 nike

我正在尝试从调查响应 DF 中提取一些记录。所有这些记录都需要至少包含一些关键词中的一个。例如:现在我有一个数据框 df:

svy_rspns_txt
I like it
I hate it
It's a scam
It's shaddy
Scam!
Good service
Very disappointed

现在如果我跑

kw="hate,scam,shaddy,disappoint"
sensitive_words=[unicode(x,'unicode-escape') for x in kw.lower().split(",")]
df=df[df["svy_rspns_txt"].astype('unicode').str.contains('|'.join(sensitive_words),case=False,na=False)]

我会得到这样的结果

svy_rspns_txt
I hate it
It's a scam
It's shaddy
Scam!
Very disappointed

现在如何添加“matched_word”列来显示匹配的确切字符串,以便获得如下结果:

svy_rspns_txt            matched_word
I hate it hate
It's a scam scam
It's shaddy shaddy
Scam! scam
Very disappointed disappoint

最佳答案

使用带有next 的生成器表达式:

df = pd.DataFrame({'text': ["I like it", "I hate it", "It's a scam", "It's shaddy",
"Scam!", "Good service", "Very disappointed"]})

kw = "hate,scam,shaddy,disappoint"

words = set(kw.split(','))

df['match'] = df['text'].apply(lambda x: next((i for i in words if i in x.lower()), np.nan))

print(df)

text match
0 I like it NaN
1 I hate it hate
2 It's a scam scam
3 It's shaddy shaddy
4 Scam! scam
5 Good service NaN
6 Very disappointed disappoint

您可以通过 pd.Series.notnull 过滤有效字符串或者通过注释 NaN != NaN:

res = df[df['match'].notnull()]
# or, res = df[df['match'].notna()]
# or, res = df[df['match'] == df['match']]

print(res)

text match
1 I hate it hate
2 It's a scam scam
3 It's shaddy shaddy
4 Scam! scam
6 Very disappointed disappoint

关于python - pandas DataFrame 多个子串匹配,还将一行的特定匹配子串放入新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50916118/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com