gpt4 book ai didi

python - 将计算应用于 Pandas DataFrame 中的过滤值

转载 作者:行者123 更新时间:2023-12-01 06:35:39 25 4
gpt4 key购买 nike

我是 Pandas 新手。

将此视为我的数据框:

df

Search              Impressions     Clicks      Transactions    ContainsBest       ContainsFree         Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US

我有列 ContainsBestContainsFree。我想对所有展示次数点击次数交易进行求和,其中ContainsBestTrue,然后我想对 ImpressionsClicksTransactions 求和,其中 ContainsFree 为 True,并对每个唯一值执行相同的操作列国家。所以新的 DataFrame 看起来像这样:

output_df

Country             Impressions     Clicks      Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6

US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4

为此,我知道我需要使用如下内容:

uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')

uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])

uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])

然后,我将对点击交易应用相同的逻辑,并为Country US重做相同的代码.

我想要实现的第二件事是为每个CountryImpressionsClicks添加列TopCategoriesTransactions,这样我的 final_output_df 将如下所示:

final_output_df

Country             Impressions     Clicks      Transactions        TopCategoriesForImpressions     TopCategoriesForClicks          TopCategoriesForTransactions     
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6

US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4

TopCategoriesForxx 列逻辑是对 Country 列下的 ContainsBestContainsFree 行进行简单排序。因此,UK 国家/地区的 TopCategoriesForImpressions

  1. 包含免费
  2. 包含最佳内容

英国国家/地区的TopCategoriesForClicks是:

  1. 包含最佳内容
  2. 包含免费

我知道我需要使用这样的东西:

TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')

我发现很难将所有内容都看起来像我的上一个final_output_df。另外,我假设我不需要创建 output_df,只是想添加它以便更好地理解实现 final_output_df 的步骤。

所以我的问题是:

  1. 如何根据一个或多个条件应用计算?请参阅行 ContainsBestContainsFree
  2. 如何根据条件对列值进行排序?请参阅TopCategoriesForImpressions
  3. 实际上,我有 70 个国家/地区和 20 个 Containsxxx 列,有没有办法在不为 70 个国家/地区和 20 个 Containsxxx 列添加条件的情况下实现这一目标?

非常感谢您的建议。

最佳答案

解决方案的第一部分应该是:

#removed unnecessary column Search and added ContainAll column filled Trues
df1 = df.drop('Search', 1).assign(ContainAll = True)

#columns for tests
cols1 = ['Impressions','Clicks','Transactions']
cols2 = ['ContainsBest','ContainsFree','ContainAll']

print (df1[cols2].dtypes)
ContainsBest bool
ContainsFree bool
ContainAll bool
dtype: object

print (df1[cols1].dtypes)
Impressions int64
Clicks int64
Transactions int64
dtype: object
<小时/>
print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask'))
Country Impressions Clicks Transactions Type mask
0 UK 10 5 1 ContainsBest True
1 UK 15 4 2 ContainsBest True
2 UK 20 3 4 ContainsBest False
3 US 13 1 5 ContainsBest False
4 US 12 3 4 ContainsBest False
5 UK 10 5 1 ContainsFree False
6 UK 15 4 2 ContainsFree True
7 UK 20 3 4 ContainsFree True
8 US 13 1 5 ContainsFree False
9 US 12 3 4 ContainsFree True
10 UK 10 5 1 ContainAll True
11 UK 15 4 2 ContainAll True
12 UK 20 3 4 ContainAll True
13 US 13 1 5 ContainAll True
14 US 12 3 4 ContainAll True

print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask').query('mask'))
Country Impressions Clicks Transactions Type mask
0 UK 10 5 1 ContainsBest True
1 UK 15 4 2 ContainsBest True
6 UK 15 4 2 ContainsFree True
7 UK 20 3 4 ContainsFree True
9 US 12 3 4 ContainsFree True
10 UK 10 5 1 ContainAll True
11 UK 15 4 2 ContainAll True
12 UK 20 3 4 ContainAll True
13 US 13 1 5 ContainAll True
14 US 12 3 4 ContainAll True
<小时/>
#all possible combinations of Country and boolean columns
mux = pd.MultiIndex.from_product([df['Country'].unique(), cols2],
names=['Country','Type'])

#reshape by melt for all boolean column to one mask column
#filter Trues by loc and aggregate sum
#add 0 rows by reindex
df1 = (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask')
.query('mask')
.drop('mask', axis=1)
.groupby(['Country','Type'])
.sum()
.reindex(mux, fill_value=0)
.reset_index())
print (df1)
Country Type Impressions Clicks Transactions
0 UK ContainsBest 25 9 3
1 UK ContainsFree 35 7 6
2 UK ContainAll 45 12 7
3 US ContainsBest 0 0 0
4 US ContainsFree 12 3 4
5 US ContainAll 25 4 9

其次,可以使用 descending ordernumpy.argsort 过滤行进行检查排序。每组:

def f(x):
i = x.index.to_numpy()
a = i[(-x.to_numpy()).argsort(axis=0)]
return pd.DataFrame(a, columns=x.columns)


df2 = (df1[df1['Type'].isin(['ContainsBest','ContainsFree']) &
~df1[cols1].eq(0).all(1)]
.set_index('Type')
.groupby('Country')[cols1]
.apply(f)
.add_prefix('TopCategoriesFor')
.rename_axis(['Country','Type'])
.rename({0:'ContainsBest', 1:'ContainsFree'})
)
print (df2)
TopCategoriesForImpressions TopCategoriesForClicks \
Country Type
UK ContainsBest ContainsFree ContainsBest
ContainsFree ContainsBest ContainsFree
US ContainsBest ContainsFree ContainsFree

TopCategoriesForTransactions
Country Type
UK ContainsBest ContainsFree
ContainsFree ContainsBest
US ContainsBest ContainsFree
<小时/>
df3 = df1.join(df2, on=['Country','Type'])
print (df3)
Country Type Impressions Clicks Transactions \
0 UK ContainsBest 25 9 3
1 UK ContainsFree 35 7 6
2 UK ContainAll 45 12 7
3 US ContainsBest 0 0 0
4 US ContainsFree 12 3 4
5 US ContainAll 25 4 9

TopCategoriesForImpressions TopCategoriesForClicks \
0 ContainsFree ContainsBest
1 ContainsBest ContainsFree
2 NaN NaN
3 ContainsFree ContainsFree
4 NaN NaN
5 NaN NaN

TopCategoriesForTransactions
0 ContainsFree
1 ContainsBest
2 NaN
3 ContainsFree
4 NaN
5 NaN

关于python - 将计算应用于 Pandas DataFrame 中的过滤值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59682482/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com