gpt4 book ai didi

Python:如何使机器学习预测在生产中运行得更快?

转载 作者:行者123 更新时间:2023-11-28 18:25:10 25 4
gpt4 key购买 nike

我已经在 scikit-learn 中创建了一个机器学习模型,我需要将其部署到具有实时数据的生产环境中。例如,这些功能如下所示:

  date          event_id  user_id     feature1    feature2    featureX...
2017-01-27 100 5555 1.23 2 2.99
2017-01-27 100 4444 2.55 5 3.16
2017-01-27 100 3333 0.45 3 1.69
2017-01-27 105 1212 3.96 4 0.0
2017-01-27 105 2424 1.55 2 5.56
2017-01-27 105 3636 0.87 4 10.28

所以,每天都有不同的事件。在事件开始之前,我基本上通过从数据库中提取它们并将其存储在数据框中,并使用 pickled scikit 模型计算预测:

df_X = df.drop(['date', 'event_id', 'user_id'], axis=1)
loaded_model = joblib.load("model.joblib.dat")
prediction = loaded_model.predict_proba(df_X)

然后我将预测匹配回 df 并根据需要作为输出发送到 API 或文件。

当事件开始时 featureX不断更新,我从 API 获得。为了进行更新,我使用了遍历每个 event_id 的循环和 user_id并更新 df与新 featureX值,重新计算并再次发送到输出。

为此我正在做这样的事情:

# get list of unique event ids
events = set(df['event_id'].tolist())

try:
while True:
start = time.time()
for event in events:
featureX = request.get(API_URL + event)
featureX_json = featureX.json()

for user in featureX_json['users']:
df.loc[df.user_id == user['user_id'],
'featureX'] = user['featureX']

df_X = df.drop(['date', 'event_id', 'user_id'], axis=1)
df['prediction'] = loaded_model.predict_proba(df_X)

# send to API or write to file

end = time.time()
print('recalculation time {} secs'.format(end - start))

except KeyboardInterrupt:
print('exiting !')

这对我来说很好,但整个预测更新在服务器中需要大约 4 秒,我需要它在 1 秒以下。我想弄清楚我可以在 while loop 中改变什么获得我需要的加速?

已根据 event_id = 100 的请求添加了 json 示例网址 http://myapi/api/event_users/<event_id> :

{
"count": 3,
"users": [
{
"user_id": 4444,
"featureY": 34,
"featureX": 4.49,
"created": "2017-01-17T13:00:09.065498Z"
},
{
"user_id": 3333,
"featureY": 22,
"featureX": 1.09,
"created": "2017-01-17T13:00:09.065498Z"
},
{
"user_id": 5555,
"featureY": 58,
"featureX": 9.54,
"created": "2017-01-17T13:00:09.065498Z"
}
]
}

最佳答案

# get list of unique event ids
events = df['event_id'].unique().tolist()

try:
while True: # i don't understand why do you need this loop...
start = time.time()
for event in events:
featureX = request.get(API_URL + event)
tmp = pd.DataFrame(featureX.json()['users'])

df.loc[(df.event_id == event), 'featureX'] = \
df.loc[df.event_id == event, 'user_id'] \
.map(tmp.set_index('user_id').featureX)

df_X = df.drop(['date', 'event_id', 'user_id'], axis=1)
df['prediction'] = loaded_model.predict_proba(df_X)

# send to API or write to file

end = time.time()
print('recalculation time {} secs'.format(end - start))

except KeyboardInterrupt:
print('exiting !')

演示: event_id == 100

首先让我们从您的 JSON 对象创建一个 DF:

tmp = pd.DataFrame(featureX_json['users'])

In [33]: tmp
Out[33]:
created featureX featureY user_id
0 2017-01-17T13:00:09.065498Z 4.49 34 4444
1 2017-01-17T13:00:09.065498Z 1.09 22 3333
2 2017-01-17T13:00:09.065498Z 9.54 58 5555

现在我们可以摆脱 for user in featureX_json['users']: 循环:

In [29]: df.loc[df.event_id == 100, 'featureX'] = \
df.loc[df.event_id == 100, 'user_id'].map(tmp.set_index('user_id').featureX)

In [30]: df
Out[30]:
date event_id user_id feature1 feature2 featureX
0 2017-01-27 100 5555 1.23 2 9.54 # 2.99 -> 9.54
1 2017-01-27 100 4444 2.55 5 4.49 # 3.16 -> 4.49
2 2017-01-27 100 3333 0.45 3 1.09 # 1.69 -> 1.09
3 2017-01-27 105 1212 3.96 4 0.00
4 2017-01-27 105 2424 1.55 2 5.56
5 2017-01-27 105 3636 0.87 4 10.28

关于Python:如何使机器学习预测在生产中运行得更快?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41891978/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com