gpt4 book ai didi

python - 分层数据 : efficiently build a list of every descendant for each node

转载 作者:太空狗 更新时间:2023-10-29 17:18:27 25 4
gpt4 key购买 nike

我有一个两列数据集,描述了形成一棵大树的多个父子关系。我想用它来为每个节点构建每个后代的更新列表。

原始输入:

   child  parent
1 2010 1000
7 2100 1000
5 2110 1000
3 3000 2110
2 3011 2010
4 3033 2100
0 3102 2010
6 3111 2110

关系的图形描述:

example-data-relationship-tree

预期输出:

    descendant  ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110

最初我决定使用 DataFrame 的递归解决方案。它按预期工作,但 Pandas 非常低效。我的研究让我相信,使用 NumPy 数组(或其他简单数据结构)的实现在大型数据集(成千上万条记录的 10 的数据集)上会快得多。

使用数据框的解决方案:

import pandas as pd

df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)


def get_ancestry_dataframe_flat(df):

def get_child_list(parent_id):

list_of_children = list()
list_of_children.append(df[df['parent'] == parent_id]['child'].values)

for i, r in df[df['parent'] == parent_id].iterrows():
if r['child'] != parent_id:
list_of_children.append(get_child_list(r['child']))

# flatten list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children

new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
for index, row in df.iterrows():
temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
temp_df['ancestor'] = row['parent']
new_df = new_df.append(temp_df)

new_df = new_df\
.drop_duplicates()\
.sort_values(['ancestor', 'descendant'])\
.reset_index(drop=True)

return new_df

因为以这种方式使用 pandas DataFrames 在大型数据集上效率非常低,我需要提高此操作的性能。我的理解是,这可以通过使用更高效的数据结构来更好地完成适用于循环和递归。我想以最有效的方式执行相同的操作。

具体来说,我要求优化速度。

最佳答案

这是一种使用 numpy 一次迭代树下一代的方法。

代码:

import numpy as np
import pandas as pd # only used to return a dataframe


def list_ancestors(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Return pandas dataframe of all descendant/ancestor node pairs

Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})

df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300

list_ancestors(df.values)

returns

descendant ancestor
0 200 100
1 201 100
2 300 200
3 300 100
4 301 200
5 301 100
6 302 201
7 302 100
8 400 300
9 400 200
10 400 100
"""
ancestors = []
for ar in trace_nodes(edges):
ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
ar[:, 1:].flatten()])
return pd.DataFrame(np.concatenate(ancestors),
columns=['descendant', 'ancestor'])


def trace_nodes(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Yield numpy array with cross-section of tree and associated
ancestor nodes

Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})

df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300

trace_nodes(df.values)

yields

array([[200, 100],
[201, 100]])

array([[300, 200, 100],
[301, 200, 100],
[302, 201, 100]])

array([[400, 300, 200, 100]])
"""
mask = np.in1d(edges[:, 1], edges[:, 0])
gen_branches = edges[~mask]
edges = edges[mask]
yield gen_branches
while edges.size != 0:
mask = np.in1d(edges[:, 1], edges[:, 0])
next_gen = edges[~mask]
gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
edges = edges[mask]
yield gen_branches


def numpy_col_inner_many_to_one_join(ar1, ar2):
"""
Take two 2-d numpy arrays ar1 and ar2,
with no duplicate values in first column of ar2
Return inner join of ar1 and ar2 on
last column of ar1, first column of ar2

Ex:

ar1 = np.array([[1, 2, 3],
[4, 5, 3],
[6, 7, 8],
[9, 10, 11]])

ar2 = np.array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10],
[11, 12]])

numpy_col_inner_many_to_one_join(ar1, ar2)

returns

array([[ 1, 2, 3, 4],
[ 4, 5, 3, 4],
[ 9, 10, 11, 12]])
"""
ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
bins = np.bincount(ar1[:, -1])
counts = bins[bins.nonzero()[0]]
else:
counts = np.unique(ar1[:, -1], False, False, True)[1]
left = ar1[ar1[:, -1].argsort()]
right = ar2[ar2[:, 0].argsort()]
return np.concatenate([left[:, :-1],
right[np.repeat(np.arange(right.shape[0]),
counts)]], 1)

时序比较:

@taky2 提供的测试用例 1 和 2,测试用例 3 和 4 分别比较了高树结构和宽树结构的性能——大多数用例可能介于两者之间。

df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}
)

df2 = pd.DataFrame(
{
'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
}
)

df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
np.c_[np.arange(501, 1001), np.arange(500)]],
columns=['child', 'parent'])

df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
np.c_[np.arange(1001, 11001),
np.repeat(np.arange(1, 101), 100)]],
columns=['child', 'parent'])

%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop

%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop

%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 µs per loop

%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 µs per loop

%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop

%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop

%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop

%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 µs per loop

%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop

%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop

%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop

%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop

%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop

%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop

注意事项:

get_ancestry_dataframe_flat 由于时间和内存问题,案例 3 和 4 未计时。

add_children_of_children 修改为在内部识别根节点,但允许采用唯一的根。添加了第一行 root_node = (set(dataframe.parent) - set(dataframe.child)).pop()

all_descendants_nx 修改为接受数据框作为参数,而不是从外部命名空间中提取。

展示正确行为的示例:

np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True) ==\
list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True))
Out[20]: True

关于python - 分层数据 : efficiently build a list of every descendant for each node,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46722740/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com