gpt4 book ai didi

python - Python Pandas-基于字符串中的子字符串合并

转载 作者:行者123 更新时间:2023-12-03 13:48:23 25 4
gpt4 key购买 nike

我有2个数据框,格式如下:

df_search

SEARCH
part1
anotherpart
onemorepart


df_all

FILE EXTENSION PATH
part1_1 .prt //server/folder1/part1_1
part1_2 .prt //server/folder2/part1_2
part1_2 .pdf //server/folder3/part1_2
part1_3 .prt //server/folder2/part1_3
anotherpart_1 .prt //server/folder1/anotherpart_1
anotherpart_2 .prt //server/folder3/anotherpart_2
anotherpart_3 .prt //server/folder2/anotherpart_3
anotherpart_3 .cgm //server/folder1/anotherpart_3
anotherpart_4 .prt //server/folder3/anotherpart_4
onemorepart_1 .prt //server/folder2/onemorepart_1
onemorepart_2 .prt //server/folder1/onemorepart_2
onemorepart_2 .dwg //server/folder2/onemorepart_2
onemorepart_3 .prt //server/folder1/onemorepart_3
onemorepart_4 .prt //server/folder1/onemorepart_4

完整的df_search有15,000个项目。 df_all有550,000个项目。我正在尝试基于文件字符串中的搜索项字符串来合并两个数据框。我想要的输出是这样的:
SEARCH       FILE            EXTENSION  PATH    
part1 part1_1 .prt //server/folder1/part1_1
part1 part1_2 .prt //server/folder2/part1_2
part1 part1_2 .pdf //server/folder3/part1_2
part1 part1_3 .prt //server/folder2/part1_3
anotherpart anotherpart_1 .prt //server/folder1/anotherpart_1
anotherpart anotherpart_2 .prt //server/folder3/anotherpart_2
anotherpart anotherpart_3 .prt //server/folder2/anotherpart_3
anotherpart anotherpart_3 .cgm //server/folder1/anotherpart_3
anotherpart anotherpart_4 .prt //server/folder3/anotherpart_4
onemorepart onemorepart_1 .prt //server/folder2/onemorepart_1
onemorepart onemorepart_2 .prt //server/folder1/onemorepart_2
onemorepart onemorepart_2 .dwg //server/folder2/onemorepart_2
onemorepart onemorepart_3 .prt //server/folder1/onemorepart_3
onemorepart onemorepart_4 .prt //server/folder1/onemorepart_4

简单的数据框合并将不起作用,因为字符串永远不会完全匹配(它始终是子字符串)。我还根据其他关于stackoverflow的问题尝试了以下方法:
df_all[df_all.name.str.contains('|'.join(df_search.search))]

这给了我df_all中所有找到的项目的完整列表,但是我不知道哪个搜索字符串返回了哪个结果。

我设法使其与for循环一起使用,但是对我的数据集来说很慢(67分钟):
super_df = []
for search_item in df_search.search:
df_entire.loc[df_entire.file.str.contains(search_item), 'search'] = search_item
temp_df = df_entire[df_entire.file.str.contains(search_item)]
super_df = pd.concat(super_df, axis=0, ignore_index=True)

通过矢量化可以做到这一点以提高性能吗?

谢谢

最佳答案

使用 str.extract + insert :

pat = "|".join(df_search.SEARCH)
df_all.insert(0, 'SEARCH', df_all['FILE'].str.extract("(" + pat + ')', expand=False))
print (df_all)
SEARCH FILE EXTENSION PATH
0 part1 part1_1 .prt //server/folder1/part1_1
1 part1 part1_2 .prt //server/folder2/part1_2
2 part1 part1_2 .pdf //server/folder3/part1_2
3 part1 part1_3 .prt //server/folder2/part1_3
4 anotherpart anotherpart_1 .prt //server/folder1/anotherpart_1
5 anotherpart anotherpart_2 .prt //server/folder3/anotherpart_2
6 anotherpart anotherpart_3 .prt //server/folder2/anotherpart_3
7 anotherpart anotherpart_3 .cgm //server/folder1/anotherpart_3
8 anotherpart anotherpart_4 .prt //server/folder3/anotherpart_4
9 onemorepart onemorepart_1 .prt //server/folder2/onemorepart_1
10 onemorepart onemorepart_2 .prt //server/folder1/onemorepart_2
11 onemorepart onemorepart_2 .dwg //server/folder2/onemorepart_2
12 onemorepart onemorepart_3 .prt //server/folder1/onemorepart_3
13 onemorepart onemorepart_4 .prt //server/folder1/onemorepart_4

关于python - Python Pandas-基于字符串中的子字符串合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48743662/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com