gpt4 book ai didi

python - 计算 Pandas 数据框中两行 LDA 分布之间的距离

转载 作者:行者123 更新时间:2023-12-04 07:42:41 24 4
gpt4 key购买 nike

我有一个数据框,其中包含 LDA 主题分布输出以及其他人口统计信息,如下所示:

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252, 'LDA_2':0.002, 'LDA_3':0.50},
{"department": 'engineering', 'LDA_1': 0.478, 'LDA_2':0.152, 'LDA_3':0.492},
{"department": 'cooperate', 'LDA_1': 0.52, 'LDA_2':0.780, 'LDA_3':0.50},
{"department": "marketing", 'LDA_1': 0.352, 'LDA_2':0.052, 'LDA_3':0.20}])
enter image description here
我想进入下面的最终数据框。如何编写一个函数来计算返回数据框下方的两行(包含“LDA_”的列名)之间的 Jenson-Shannon 距离?
i j same_department distance_LDA
0 1 0 0.23
0 2 0 0.43
0 3 1 0.26
1 2 0 0.24
1 3 0 0.11
2 3 0 0.29
我已经编写了代码来计算各个对之间的 JS 距离,如下所示。如何将其转换为函数?
array=single_df.filter(regex='LDA').to_numpy()
distance.jensenshannon(array[0],array[1])
然后计算两个人是否共享部门,我有下面的代码:
def same_department(i,j):
if i['department'] == j['department']:
return 1
else:
return 0

最佳答案

让我们尝试生成所有可能的行组合,合并以创建一个 DataFrame,其中可以在同一行中进行比较。然后根据列后缀逐行应用 jensenshannon 函数:

from itertools import combinations
from scipy.spatial.distance import jensenshannon
import pandas as pd

single_df = pd.DataFrame([{"department": 'marketing', 'LDA_1': 0.252,
'LDA_2': 0.002, 'LDA_3': 0.50},
{"department": 'engineering', 'LDA_1': 0.478,
'LDA_2': 0.152, 'LDA_3': 0.492},
{"department": 'cooperate', 'LDA_1': 0.52,
'LDA_2': 0.780, 'LDA_3': 0.50},
{"department": "marketing", 'LDA_1': 0.352,
'LDA_2': 0.052, 'LDA_3': 0.20}])

# Merge the 3 LDA Columns Into A Single Column Containing a List
single_df['LDA'] = single_df.filter(regex='^LDA_.*').agg(list, axis=1)
# Get Rid Of The Original LDA_X columns
single_df = single_df.filter(regex='^(?!LDA_.*)')

# Get All Row Combinations
a, b = map(list, zip(*combinations(single_df.index, 2)))

# Merge Together
df = single_df.loc[a].reset_index().merge(
single_df.loc[b].reset_index(),
left_index=True,
right_index=True,
)

# Apply jensonshannon to LDA_x and LDA_y Lists
df['distance_LDA'] = df.apply(
lambda x: jensenshannon(x['LDA_x'], x['LDA_y']), axis=1)

# Get If In Same Department
df['same_department'] = df['department_x'].eq(df['department_y']).astype(int)

# Rename and Filter Columns
df = df \
.rename(columns={'index_x': 'i',
'index_y': 'j'})[['i', 'j',
'same_department',
'distance_LDA']]

# For Display
print(df.to_string(index=False))
输出:
i  j  same_department  distance_LDA
0 1 0 0.235849
0 2 0 0.429508
0 3 1 0.264777
1 2 0 0.238155
1 3 0 0.112456
2 3 0 0.299704

关于python - 计算 Pandas 数据框中两行 LDA 分布之间的距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67376488/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com