gpt4 book ai didi

Python:合并 str.contains 并合并到 pandas

转载 作者:太空宇宙 更新时间:2023-11-03 12:01:33 25 4
gpt4 key购买 nike

我有两个看起来有点像下面的数据框(df1 中的 Content 列实际上是一篇文章的全部内容,而不是像我的示例中那样,只是一句话):

    PDF     Content
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 1111 Johannes writes about apples and oranges and that's great.
3 8000 Content that cannot be matched to the anything in df1.
4 3993 There is an interesting piece on bananas plus kiwis as well.
...

(总计:5709 个条目)

    Author        Title
1 Johannes Apples and oranges
2 Peter Bananas and pears and grapes
3 Hannah Bananas plus kiwis
4 Helena Mangos and peaches
...

(总计:10228 个条目)

我想通过在 df1Content 中搜索 df2Title 来合并两个数据帧.如果标题出现在内容的前 2500 个字符 中的某处,则匹配。注意:保留 df1 中的所有 条目很重要。相反,我只想保留 df2 中匹配的条目(即左连接)。注意:所有 Titles 都是唯一值。

期望的输出(列顺序无关紧要):

    Author     Title                        PDF     Content
1 Peter Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 Johannes Apples and oranges 1111 Johannes writes about apples and oranges and that's great.
3 NaN NaN 8000 Content that cannot be matched to the anything in df2.
4 Hannah Bananas plus kiwis 3993 There is an interesting piece on bananas plus kiwis as well.
...

我想我需要 pd.mergestr.contains 的组合,但我不知道怎么做!

最佳答案

警告:解决方案可能很慢:)。
1. 获取标题列表
2.根据标题列表顺序为df1创建索引
3. 在 idx 上连接 df1 和 df2

  lst = [item.lower() for item in df2.Title.tolist()]
end = len(lst)
def func(row):
content = row[:2500].lower()
for i, item in enumerate(lst):
if item in content:
return i
end += 1
return end
df1 = df1.assign(idx=df1.Content.apply(func))

res = pd.concat([df1.set_index('idx'), df2], axis=1)

输出

      PDF                                            Content    Author  \
0 1111.0 Johannes writes about apples and oranges and t... Johannes
1 1234.0 This article is about bananas and pears and gr... Peter
2 3993.0 There is an interesting piece on bananas plus ... Hannah
3 NaN NaN Helena
4 8000.0 Content that cannot be matched to the anything... NaN

Title
0 Apples and oranges
1 Bananas and pears and grapes
2 Bananas plus kiwis
3 Mangos and peaches
4 NaN

关于Python:合并 str.contains 并合并到 pandas,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46814225/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com