python - 我可以让这段代码更有效率吗？目前运行约 100 万条条目需要约 6 小时-6ren

python - 我可以让这段代码更有效率吗？目前运行约 100 万条条目需要约 6 小时

转载作者：行者123 更新时间：2023-12-01 00:56:44

我有 2 个 DataFrame:

trips_df 总条目 = 1,048,568

weather_df 条目总数 = 2,654

我正在尝试计算每次旅行的total_precipitation并将其附加为一列。为此，我在 weather_df 中的 trips_df 中查找每次行程的 start_timestamp 和 end_timestamp 日期时间，并对这些时间内的 precipitation_amount 求和，然后将该值附加到新列下的 trips_df 中。

用于执行此操作的代码:

def sum_precipitation(datetime1, datetime2, weather_data):

    time1_rd = datetime1.replace(minute=0, second=0)
    time2_ru = datetime2.replace(minute=0, second=0) + dt.timedelta(hours=1)

    if time1_rd in set(weather_data['start_precipitation_datetime']):

        start_idx = weather_data.start_precipitation_datetime[
            weather_data.start_precipitation_datetime==time1_rd].index[0]

        if time2_ru in set(weather_data['end_precipitation_datetime']):

            end_idx = weather_data.end_precipitation_datetime[
                weather_data.end_precipitation_datetime==time2_ru].index[0]

            precipitation_sum = weather_data.iloc[start_idx:end_idx+1, 7].sum()

        else: precipitation_sum = 0
    else: precipitation_sum = 0

    return round(precipitation_sum, 3)

def join_weather_to_trips(trips_data, weather_data):

    trips_weather_df = trips_data.copy()

    fn = lambda row : sum_precipitation(row.start_timestamp, row.end_timestamp, weather_data)
    col = trips_data.apply(fn, axis=1)
    trips_weather_df = trips_weather_df.assign(total_precipitation=col.values)

    return trips_weather_df


trip_weather_df = join_weather_to_trips(trips_df, weather_df)

我在 65 个条目的子集上运行了代码，花费了大约 1.3 秒。 (CPU 时间:用户 1.27 秒，系统:8.77 毫秒，总计:1.28 秒，挂起时间:1.28 秒)。将该性能推断到我的整个数据，需要 (1.3 * 1048568)/65 = 20971.36 秒或 5.8 小时。

有更多经验的人可以告诉我我这样做是否正确，我可以在哪里加快此代码，或者是否有任何替代方案(例如更快的实现)？

最佳答案

这可能不是最快的，但你可以尝试:

trips_df['precipitation_amount'] = 0

for s,e,p in zip(weather_df['start_precipitation_datetime'], 
               weather_df['end_precipitation_datetime'],
               weather_df.precipitation_amount):
    masks = trips_df.start_timestamp.between(s,e) | trips_df.end_timestamp.between(s,e)
    trips_df.loc[masks, 'precipitation_amount'] += p

在我的电脑上，处理 100 万次行程和 260 种天气需要 10 秒。所以实际数据大约需要 100 秒。

更新:我确实尝试过 100 万次旅行和 2600 种天气，Wall time:1 分钟 36 秒

注意:您可能需要将 weather_df['end_precipitation_datetime'] 减少一分钟，以避免行程在整点开始时出现重复计数。

关于python - 我可以让这段代码更有效率吗？目前运行约 100 万条条目需要约 6 小时，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56191458/