gpt4 book ai didi

python - 如何自动从数据框列进行自然对数计算?

转载 作者:行者123 更新时间:2023-12-04 03:33:59 27 4
gpt4 key购买 nike

我必须创建列来计算数据集中其他列的自然对数。列(特征)太多,我想让它自动,但我试过的 for 循环没有用。这是我称为“功能”的列的列表:

features=['price_seat',
'days_length_of_stay',
'days_to_departure',
'distance',
'unit_cost_brute',
'unit_cost_clip',
'unit_cost_mean',
'unit_cost',
'org_country_gdp_per_capita',
'dst_country_gdp_per_capita',
'competing_airline',
#'yield',
'price_seat_cluster',
'yield_cluster',
'low_cost',
#'PAX',
#'REVENUE',
'LOCAL_PAX',
'BEHIND_PAX',
'BEYOND_PAX',
'BRIDGE_PAX',
'LOCAL_REVENUE',
'BEHIND_REVENUE',
'BEYOND_REVENUE',
'BRIDGE_REVENUE',
'REVENUE_WITH_TAXES',
'LOCAL_REVENUE_WITH_TAXES',
'BRIDGE_REVENUE_WITH_TAXES',
'BEHIND_REVENUE_WITH_TAXES',
'BEYOND_REVENUE_WITH_TAXES',
'PERIOD',
'n_flights_month',
'avg_flights_month',
'flights_month',
#'pax_flight',
'revenue_flight',
#'revenue_pax',
'WTI',
'Brent',
'Jet_fuel',
'OilPrice_USD_bbl',
'FuelPrice_USD_USgal',
'Density',
'Cf_USD_kg',
'd_fr24',
'distance_fr']

这是我使用过的代码,它可以工作:

 df=df9.withColumn('ln_price_seat', F.log('price_seat'))\
.withColumn('ln_days_length_of_stay',F.log('days_length_of_stay'))\
.withColumn('ln_days_to_departure',F.log('days_to_departure'))\
.withColumn('ln_distance',F.log('distance'))\
.withColumn('ln_unit_cost_brute',F.log('unit_cost_brute'))\
.withColumn('ln_unit_cost_clip',F.log('unit_cost_clip'))\
.withColumn('ln_unit_cost_mean',F.log('unit_cost_mean'))

但这对于这么多功能来说太“手动”了,我可能会在未来更改这些功能,所以我需要一些可以处理的东西。最重要的是,我的数据框非常大,大约 50M 或更多。在执行此操作之前,我能够执行此过程:

def get_log_features(self,df):


features=['price_seat',
'days_length_of_stay',
'days_to_departure',
'distance',
'unit_cost_brute',
'unit_cost_clip',
'unit_cost_mean',
'unit_cost',
'org_country_gdp_per_capita',
'dst_country_gdp_per_capita',
'competing_airline',
'price_seat_cluster',
'yield_cluster',
'low_cost',
'LOCAL_PAX',
'BEHIND_PAX',
'BEYOND_PAX',
'BRIDGE_PAX',
'LOCAL_REVENUE',
'BEHIND_REVENUE',
'BEYOND_REVENUE',
'BRIDGE_REVENUE',
'REVENUE_WITH_TAXES',
'LOCAL_REVENUE_WITH_TAXES',
'BRIDGE_REVENUE_WITH_TAXES',
'BEHIND_REVENUE_WITH_TAXES',
'BEYOND_REVENUE_WITH_TAXES',
'PERIOD',
'n_flights_month',
'avg_flights_month',
'flights_month',
'revenue_flight',
'WTI',
'Brent',
'Jet_fuel',
'OilPrice_USD_bbl',
'FuelPrice_USD_USgal',
'Density',
'Cf_USD_kg',
'd_fr24',
'distance_fr']



features_for_log=features
df_log= (df.select(*features_for_log,'org_airport','dst_airport','d_year','d_month'))

for new_col in features_for_log:
df_log = df_log.withColumn('ln_'+ new_col, F.log(F.col(new_col)))

df_log= (df_log.drop(*features_for_log))


df=(df.join(df_log,['org_airport','dst_airport','d_year','d_month'],how='outer'))

但是当我调用这个函数时它需要几个小时,它的计算成本太高,这就是为什么我想用特征列表定义的列的自然对数“附加”原始数据框,而且这可能更便宜。

你有什么建议吗?

最佳答案

最简单和最快的方法就是您已经描述的方法:将日志列添加到数据框:

cols = [F.col(col) for col in df.columns]
ln_cols = [F.log(col).alias(f"ln_{col}") for col in features_for_log]
df = df.select(cols + ln_cols)

关于python - 如何自动从数据框列进行自然对数计算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67303469/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com