gpt4 book ai didi

python - 左合并 dask 数据帧结果到空数据帧

转载 作者:行者123 更新时间:2023-11-28 19:03:51 25 4
gpt4 key购买 nike

我有以下代码:

raw_data = pd.DataFrame({'username':list('ab')*10, 'user_agent': list('cdef')*5, 'method':['POST'] * 20, 'dst_port':[80]*20, 'dst':['1.1.1.1']*20})
past = pd.DataFrame({'user_agent':list('cde'), 'percent':[0.3, 0.3, 0.4]})
past = past.set_index('user_agent')
import dask.dataframe as dd
dask_raw = dd.from_pandas(raw_data, npartitions=4)
dask_past = dd.from_pandas(past, npartitions=4)
merged_raw = dask_raw.merge(dask_past, how='left', left_on='user_agent', right_index=True)

Compute on merged_raw 给出了这种形式的数据框:

Out[20]: 
dst dst_port method user_agent username percent
12 1.1.1.1 80 POST c a 0.3
16 1.1.1.1 80 POST c a 0.3
8 1.1.1.1 80 POST c a 0.3
0 1.1.1.1 80 POST c a 0.3
4 1.1.1.1 80 POST c a 0.3
10 1.1.1.1 80 POST e a 0.4
11 1.1.1.1 80 POST f b NaN
13 1.1.1.1 80 POST d b 0.3
14 1.1.1.1 80 POST e a 0.4
15 1.1.1.1 80 POST f b NaN
17 1.1.1.1 80 POST d b 0.3
18 1.1.1.1 80 POST e a 0.4
19 1.1.1.1 80 POST f b NaN
5 1.1.1.1 80 POST d b 0.3
6 1.1.1.1 80 POST e a 0.4
7 1.1.1.1 80 POST f b NaN
9 1.1.1.1 80 POST d b 0.3
1 1.1.1.1 80 POST d b 0.3
2 1.1.1.1 80 POST e a 0.4
3 1.1.1.1 80 POST f b NaN

计算特征:

grouped_by_df = merged_raw.groupby(['username', 'dst', 'dst_port'])
feature_one = grouped_by_df.apply(lambda x: 'POST' in x.values).to_frame(name='feature_one')
feature_two = grouped_by_df.percent.min()
feature_two = feature_two.fillna(0)
feature_two = feature_two.to_frame(name='feature_two')
features_three = grouped_by_df.method.apply(lambda x: 'CONNECT' in x.values).to_frame(name='feature_three')
features = feature_one.merge(feature_two, left_index=True, right_index=True, how='left')
features.compute()
feature_one feature_two
username dst dst_port
a 1.1.1.1 80 True 0.3
b 1.1.1.1 80 True 0.3

features_full = features.merge(features_three, how='left', right_index=True, left_index=True)
features_full.compute()

结果是:

Out[53]: 
Empty DataFrame
Columns: [feature_one, feature_two, feature_three]
Index: []

但是 features_three 有值并且和 features 是同一个索引

feature_three.compute()
username dst dst_port
a 1.1.1.1 80 False
b 1.1.1.1 80 False

为什么 dask 返回一个空的 datframe?

最佳答案

这并不能完全解决您的问题,但是如果我先计算 merged_raw 数据框,这就是我得到的结果。如果我评论 merged_raw.compute() 命令,我会收到一条错误消息。我想知道你是否可以一路使用pandas并使用dask延迟函数进行并行计算。

import dask.dataframe as dd
import pandas as pd

raw_data = pd.DataFrame({'username':list('ab')*10, 'user_agent': list('cdef')*5, 'method':['POST'] * 20, 'dst_port':[80]*20, 'dst':['1.1.1.1']*20})
past = pd.DataFrame({'user_agent':list('cde'), 'percent':[0.3, 0.3, 0.4]})
past = past.set_index('user_agent')

dask_raw = dd.from_pandas(raw_data, npartitions=4)
dask_past = dd.from_pandas(past, npartitions=4)
merged_raw = dask_raw.merge(dask_past, how='left', left_on='user_agent', right_index=True)

merged_raw = merged_raw.compute()

grouped_by_df = merged_raw.groupby(['username', 'dst', 'dst_port'])
feature_one = grouped_by_df.apply(lambda x: 'POST' in x.values).to_frame(name='feature_one')
feature_two = grouped_by_df.percent.min()
feature_two = feature_two.fillna(0)
feature_two = feature_two.to_frame(name='feature_two')
features_three = grouped_by_df.method.apply(lambda x: 'CONNECT' in x.values).to_frame(name='feature_three')

features = feature_one.merge(feature_two, left_index=True, right_index=True, how='left')

features_full = features.merge(features_three, how='left', right_index=True, left_index=True)

features_full
Out[85]:
feature_one feature_two feature_three
username dst dst_port
a 1.1.1.1 80 True 0.3 False
b 1.1.1.1 80 True 0.3 False


# Error message when the merged_raw.compute() command is commented:
C:/CODE/apostolos.py:23: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
feature_one = grouped_by_df.apply(lambda x: 'POST' in x.values).to_frame(name='feature_one')
C:/CODE/apostolos.py:31: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
features_three = grouped_by_df.method.apply(lambda x: 'CONNECT' in x.values).to_frame(name='feature_three')

关于python - 左合并 dask 数据帧结果到空数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49485426/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com