gpt4 book ai didi

python - 长度必须匹配才能比较( Pandas 根据两个标准进行选择)

转载 作者:行者123 更新时间:2023-12-01 08:03:18 25 4
gpt4 key购买 nike

我正在为每个用户生成值,如下所示:

loDf = locDfs[user] # locDfs is a copy of locationDf elsewhere in the code... sorry for all the variable names.
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.uid], loDf.location_id)
loDf.reset_index(inplace=True)

loDf.set_index('date', inplace=True)
loDf.drop('uid', axis=1, inplace=True)

# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['uid'].fillna(user, inplace=True)

这会获取位置数据并将 location_id 转换为列,并将其与时间序列中的其他数据组合。

这基本上涵盖了数据的 reshape 。然后我需要标准化,为此,我需要查看每个列的值:

for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():

完整的函数如下:

def normalize(inputMetricDf, inputLocationDf):
'''
normalize, resample, and combine data into a single data source
'''
metricDf = inputMetricDf.copy()
locationDf = inputLocationDf.copy()

appDf = metricDf[['date', 'uid', 'app_id', 'metric']].copy()

locDf = locationDf[['date', 'uid', 'location_id']]
locDf.set_index('date', inplace=True)

# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('uid'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()

aDf = appDf.copy()
aDf.set_index('date', inplace=True)

userLocAppDfs = {}
user = ''
for uid, a2_df in aDf.groupby('uid'):
user = uid
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')

# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app_id'],
values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')

userDf['uid'] = user

userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)

# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')

# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.uid], loDf.location_id)
loDf.reset_index(inplace=True)

loDf.set_index('date', inplace=True)
loDf.drop('uid', axis=1, inplace=True)

# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['uid'].fillna(user, inplace=True)

for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(
np.nan, 0)

# fill app NaNs
for app in a2_df['app_id'].unique():
userLocAppDfs[user][app].interpolate(
method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)

df = userLocAppDfs[user].copy()

# ensure actual normality
alpha = 0.05
for app in aDf['app_id'].unique():
_, p = normaltest(userLocAppDfs[user][app])
if(p > alpha):
raise DataNotNormal(args=(user, app))

# for loc in userLocAppDfs[user]:
# could also test location data

return df

但这会产生错误:

  File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 346, in run_http_function
result = _function_handler.invoke_user_function(flask.request)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 223, in invoke_user_function
loop.run_until_complete(future)
File "/opt/python3.7/lib/python3.7/asyncio/base_events.py", line 573, in run_until_complete
return future.result()
File "/user_code/main.py", line 31, in default_model
train, endog, exog, _, _, rawDf = preprocess(ledger, apps)
File "/user_code/Wrangling.py", line 67, in preprocess
rawDf = normalize(appDf, locDf)
File "/user_code/Wrangling.py", line 185, in normalize
for loc in locationDf[(locationDf['location_id'] in loDf.columns.values) & (locationDf['uid'] == user)].location_id.unique():
File "/env/local/lib/python3.7/site-packages/pandas/core/ops.py", line 1745, in wrapper
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare

在我注意到我可能会因为 reshape 而丢失locationsDf中的位置之前,我只是在做:

    for loc in locationDf[locationDf['uid'] == user].location_id.unique():

并且这对所有其他实例都有效。但是,如果在同一个 15t 时间段中有两个位置,并且其中一个仅出现在那里,但由于 15t 窗口而被删除,那么它会给我一个错误。所以我需要另一个条件。

locationDf['location_id'] 只是一个字符串,就像交叉表列名一样。

为什么这会引发错误?

尝试回答的错误:

    for loc in locationDf[(locationDf['location_id'].isin(loDf.columns.values)) & (locationDf['uid'].isin([user])), 'location_id'].unique():
File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2927, in __getitem__
indexer = self.columns.get_loc(key)
File "/env/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 110, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(0 True
1 True
2 True
3 False
4 True
5 True
6 False
7 True
8 True
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 True
20 True
21 True
22 True
23 True
24 True
25 True
26 False
27 True
28 True
29 False
...
210 False
211 False
212 False
213 False
214 False
215 False
216 False
217 False
218 True
219 True
220 False
221 False
222 False
223 False
224 False
225 False
226 True
227 False
228 True
229 False
230 False
231 True
232 False
233 True
234 False
235 False
236 False
237 True
238 False
239 False
Length: 240, dtype: bool, 'location_id')' is an invalid key

最佳答案

将条件更改为(使用isin)

locationDf.loc[(locationDf['location_id'].isin(loDf.columns.values)) 
& (locationDf['uid'].isin(user)),'location_id'].unique()

更新

con1 = (locationDf['location_id'].isin(loDf.columns.values)
con2 = (locationDf['uid'].isin(pd.Series(user))

locationDf.loc[con1&con2,'location_id'].unique()

关于python - 长度必须匹配才能比较( Pandas 根据两个标准进行选择),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55643042/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com