gpt4 book ai didi

python - 比两个 DataFrame 中多列的自定义条件嵌套 for 循环更快的方法

转载 作者:行者123 更新时间:2023-12-04 07:22:29 24 4
gpt4 key购买 nike

我有两个数据框,如下所示:

df1
+------------+-------------------+-------------+
| Name | Topic | Date |
+------------+-------------------+-------------+
| ABC | Data Science | 2020-01-01 |
| DEF | Machine Learning | 2021-03-06 |
| ABC | Cybersecurity | 2021-01-05 |
| BHL | Cloud Computing | 2020-11-09 |
+------------+-------------------+-------------+

It has around 50,000 rows
第二个数据框有几列,但我只对以下三列感兴趣:
df2
+------------------------------------+------+-------------+
| Description | Name | Created Date|
+------------------------------------+------+-------------+
| This is good Data Science project. | XYZ | 2021-06-04 |
| Cybersecurity is important. | BBB | 2021-02-03 |
| I am Data Science Professional | ABC | 2021-02-08 |
| Machine Learning is strategic. | DEF | 2021-03-01 |
+------------------------------------+------+-------------+

It has around 300,000 rows.
我想从 df2 中找到所有行,其中:
对于 df1 中的每个唯一(名称、主题和日期),在 df2 中查找“名称”匹配且“创建日期”在 df1 中“日期”的下六个月内的行,以及“主题”在“描述'。
我使用了两个 for 循环来迭代每个数据帧的行,如下所示。 但是,问题是由于有大量的行并且以这种方式迭代每一行并不是我觉得最好的方法。你能建议任何其他方法来更快更有效地做到这一点吗? 我还想将 df1 中的“主题”、“日期”附加到 df2 的每个匹配行(某种合并,但不确定如何合并)。
我的代码如下:
import pandas as pd
from dateutil.relativedelta import relativedelta

df1 = df1.drop_duplicates() # Drop duplicate entries

df_final = pd.DataFrame()

for index1, row1 in df1.iterrows():
future_date = row1['Date'] + relativedelta(months=6)
for index2, row2 in df2.iterrows():
if ((row1['Name'] == row2['Name']) and (row1['Date] < row2['Created Date'] < future_date)
and (row1['Topic'] in row2['Description'])):
df_final = df_final.append(row2)
else:
continue

最佳答案

试试这些步骤:

# drop dup rows in df1
df1 = df1.drop_duplicates()
# merge df2 with df1 on name
df2 = df2.merge(df1, how='inner', left_on='Name', right_on='Name')
future_date = df2['Date'] + relativedelta(months=6)
# now select based on requirement
df2 = df2[(df2['Date'] > df2['Created Date']) & (df['Date'] < future_date)]
df2 = df2[df2.apply(lambda x: x['Topic'] in x['Description'], axis=1)]

关于python - 比两个 DataFrame 中多列的自定义条件嵌套 for 循环更快的方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68414014/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com