gpt4 book ai didi

python - 内存错误和训练部署分离

转载 作者:太空宇宙 更新时间:2023-11-04 04:12:36 24 4
gpt4 key购买 nike

我实际上试图将训练和部署部分分开,因为编译程序需要很长时间。

有人建议我使用 pickle dump 和 load,但将训练和部署部分分开。我尝试使用它但没有用。

def main():
print "Fetching data..."
train_df = util.get_training_data('../data/training_set_rel3.tsv')
valid_df = util.get_validation_data('../data/valid_set.tsv')

print "Standardizing scores..."
train_df, valid_df = util.append_standardized_column(train_df, valid_df, 'score')

print "Calculating perplexity feature..."

train_df, valid_df = Perplexity().fill_perplexity_columns(train_df, valid_df)

print "Calculating number of sentences feature..."

train_df, valid_df = fill_sentence_column(train_df, valid_df)

print "Cleaning for spelling and word count..."
# cleaned up data for spelling feature
vectorizer_train_spelling = util.vectorizer_clean_spelling(train_df)
train_essays_spelling = vectorizer_train_spelling['essay'].values
vectorizer_valid_spelling = util.vectorizer_clean_spelling(valid_df)
valid_essays_spelling = vectorizer_valid_spelling['essay'].values

print "Calculating total words feature..."

train_df, valid_df = fill_total_words_column(train_df, valid_df, train_essays_spelling, valid_essays_spelling)

print "Calculating unique words feature..."

train_df, valid_df = fill_unique_words_column(train_df, valid_df, train_essays_spelling, valid_essays_spelling)

print "Calculating spelling feature..."
# spelling feature
train_df, valid_df = fill_spelling_column(train_df, valid_df, train_essays_spelling, valid_essays_spelling)
print "Calculating pos tags features..."

train_df, valid_df = fill_pos_columns(train_df, valid_df)

print "Cleaning for TFIDF..."
# cleaned up data for tfidf vector feature
vectorizer_train = util.vectorizer_clean(train_df)
train_essays = vectorizer_train['essay'].values
vectorizer_valid = util.vectorizer_clean(valid_df)
valid_essays = vectorizer_valid['essay'].values

print "Calculating TFIDF features with unigram..."
train_df, valid_df = fill_tfidf_column(train_df, valid_df, train_essays, valid_essays, 1)

# print "Calculating TFIDF features with trigram..."
# train_df, valid_df = fill_tfidf_column(train_df, valid_df, train_essays, valid_essays, 3)

print train_df.head()

print valid_df.head()

COLS = ['essay_set', 'spelling_correct', 'std_sentence_count', 'std_unique_words', 'std_total_words',
'std_unique_words',
'ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRT', 'PRON', 'VERB', '.', 'X', 'std_perplexity',
'std_score']

train_df = train_df[COLS].join(train_df.filter(regex=("tfidf_*")))
valid_df = valid_df[COLS].join(valid_df.filter(regex=("tfidf_*")))

print train_df.shape
print valid_df.shape

max_essay_set = max(train_df['essay_set'])

linreg_scores_df = pd.DataFrame(columns=['essay_set', 'p', 'spearman'])

lasso_scores_df = pd.DataFrame(columns=['essay_set', 'alpha', 'p', 'spearman'])
ridge_scores_df = pd.DataFrame(columns=['essay_set', 'alpha', 'p', 'spearman'])

alphas = [x * 1.0 / 20 for x in range(20, 0, -1)]

for i in range(1, max_essay_set + 1):

print ""

train_x = np.asarray((train_df[train_df['essay_set'] == i]).drop(['essay_set', 'std_score'], axis=1))
train_std_scores = np.asarray((train_df[train_df['essay_set'] == i])['std_score'], dtype="|S6").astype(np.float)

regr = LinReg(fit_intercept=False, copy_X=False)
regr.fit(train_x, train_std_scores)

valid_x = np.asarray((valid_df[valid_df['essay_set'] == i]).drop(['essay_set', 'std_score'], axis=1))
valid_pred_std_scores = regr.predict(valid_x)

linreg_spear, p = Spearman(a=(valid_df[valid_df['essay_set'] == i])["std_score"], b=valid_pred_std_scores)
linreg_scores_df = linreg_scores_df.append({'essay_set': i, 'p': p, 'spearman': linreg_spear},
ignore_index=True)

print "Linear for Essay Set " + str(i) + ":", linreg_spear

for a in alphas:
ridge = linear_model.Ridge(alpha=a)
ridge.fit(train_x, train_std_scores)
valid_pred_std_scores_ridge = ridge.predict(valid_x)

ridge_spear, p = Spearman(a=(valid_df[valid_df['essay_set'] == i])["std_score"],
b=valid_pred_std_scores_ridge)
ridge_scores_df = ridge_scores_df.append({'essay_set': i, 'alpha': a, 'p': p, 'spearman': ridge_spear},
ignore_index=True)

print "Alpha = " + str(a) + " Ridge for Essay Set " + str(i) + ":", ridge_spear

lasso = linear_model.Lasso(alpha=a)
lasso.fit(train_x, train_std_scores)
valid_pred_std_scores_lasso = lasso.predict(valid_x)

lasso_spear, p = Spearman(a=(valid_df[valid_df['essay_set'] == i])["std_score"],
b=valid_pred_std_scores_lasso)
lasso_scores_df = lasso_scores_df.append({'essay_set': i, 'alpha': a, 'p': p, 'spearman': lasso_spear},
ignore_index=True)

print "Alpha = " + str(a) + "Lasso for Essay Set " + str(i) + ":", lasso_spear

print linreg_scores_df
print ridge_scores_df
print lasso_scores_df

linreg_scores_df.to_pickle('linreg_scores-01.pickle')
ridge_scores_df.to_pickle('ridge_scores-01.pickle')
lasso_scores_df.to_pickle('lasso_scores-01.pickle')


s1 = pickle.dumps(linreg_scores_df)
clf1 = pickle.loads(s)

s2= pickle.dumps(ridge_scores_df)
clf2 = pickle.loads(s)
s3= pickle.dumps(lasso_scores_df)
clf3 = pickle.loads(s)

这不是load和dump的正确使用方式吗。运行代码时出现内存错误,并且每次都开始训练。如何将训练和部署分开?

最佳答案

问题:

  • 您的代码每次开始时都会训练数据
  • 你使用了错误的酸洗“技术”

解决方案:

  • 将数据训练部分放到一个函数中。

    • 开始时,检查 pickled 训练文件是否已经存在:
      • 如果是:加载并使用它
      • 如果不是:调用对训练数据进行数据预处理并对其进行 pickle 的函数
  • 您正在 pickling pandas 数据帧,您需要使用正确的 (pandas) 方法进行 pickling/loading - 而不是来自 module pickle 的“原始”pickling 方法

您可以阅读更多关于酸洗的 pandas 函数(当前 0.24.x)here :

original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 
original_df.to_pickle("./dummy.pkl")
unpickled_df = pd.read_pickle("./dummy.pkl") # read pickle
print(unpickled_df)

Output:

   foo  bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9

您正在使用 loads method泡菜本身。它用于腌制以字符串形式提供的内容(不是包含文件名的字符串)。

在更改为 pandas 方法后,它应该可以工作:您将 (df.to_pickle) 处理过的内容放入 jar (file)将其存储在架子上(硬盘),如果您饿了(用于数据处理),您可以将其取出并打开(pd.read_pickle) 并使用它。

关于python - 内存错误和训练部署分离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56070379/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com