gpt4 book ai didi

python - 动态数据透视表的有效方法

转载 作者:行者123 更新时间:2023-12-01 06:42:47 25 4
gpt4 key购买 nike

我有一个名为monthly_agg 的表,其中包含每月聚合数据。

+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| yyyy_mm_dd | id | app | ex_status | active_status | active_count | active_base | ex_count | ex_base |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 123 | content | impl | impl | 390 | 321 | 344 | 340 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 333 | messages | impl | impl | 541 | 210 | 788 | 610 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+
| 2019-01-31 | 832 | photos | no | no | null | 430 | null | 100 |
+------------+-----+----------+-----------+---------------+--------------+-------------+----------+---------+

我想让每个应用程序成为一个专栏。每个应用列应包含一个百分比,计算如下:

df=spark.sql("""
SELECT
yyyy_mm_dd,
id,
app,
SUM(CASE
WHEN (app = ‘content’ AND ex_status = ‘impl’) THEN ex_count/ex_base
WHEN (active_status = 'impl') THEN active_count/active_base
END) AS percentage
FROM
monthly_agg
""")

我需要将每个 app 值作为一列,然后该列的值作为上述计算的结果。我如何使用 Pandas 而不是 HQL 以这种方式对表进行透视?理想情况下,我的输出 df 如下所示:

+------------+-----+--------------------+---------------------+
| yyyy_mm_dd | id | content_percentage | messages_percentage |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 123 | 1.2 | null |
+------------+-----+--------------------+---------------------+
| 2019-01-31 | 333 | null | 2.57 |
+------------+-----+--------------------+---------------------+

我有大约 20 个应用程序,因此保持动态会很棒。

最佳答案

使用numpy.select获取百分比,然后使用 DataFrame.pivot_table :

m1 = (df.app == 'content') & (df.ex_status == 'impl')
m2 = df.active_status == 'impl'
s1 = df.ex_count / df.ex_base
s2 = df.ex_count / df.active_base
df['percentage'] = np.select([m1, m2], [s1,s2], np.nan)

df1 = (df.pivot_table(index=['yyyy_mm_dd','id'],
columns='app',
values='percentage',
aggfunc=lambda x: x.sum(min_count=1))
.add_suffix('_percentage')
.reset_index())
print (df1)
app yyyy_mm_dd id content_percentage messages_percentage
0 20190131 123 1.011765 NaN
1 20190131 333 NaN 3.752381

编辑:

print (m1)
0 True
1 False
2 False
dtype: bool

print (m2)
0 True
1 True
2 False
Name: active_status, dtype: bool

print (s1)
0 1.011765
1 1.291803
2 NaN
dtype: float64

print (s2)
0 1.071651
1 3.752381
2 NaN
dtype: float64

关于python - 动态数据透视表的有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59371056/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com