gpt4 book ai didi

python - python 中的 sklearn `MemoryError`

转载 作者:太空宇宙 更新时间:2023-11-03 15:52:29 25 4
gpt4 key购买 nike

我尝试使用 python 2.7 和 scipy 0.18.1 实现一个简单的机器学习应用程序,我共享示例代码和下面的训练数据下载链接,您可以复制粘贴并运行,我的问题是当我得到“内存错误”

predicted = model.predict_proba(test_data[features])

我在互联网上搜索但无法修复,感谢您的帮助..

您可以通过此链接找到示例数据:https://www.kaggle.com/c/sf-crime/data

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load Data with pandas, and parse the first column into datetime
train = pd.read_csv('train.csv', parse_dates=['Dates'])
test = pd.read_csv('test.csv', parse_dates=['Dates'])

# Convert crime labels to numbers
le_crime = preprocessing.LabelEncoder()
crime = le_crime.fit_transform(train.Category)

# Get binarized weekdays, districts, and hours.
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = train.Dates.dt.hour
hour = pd.get_dummies(hour)

# Build new array
train_data = pd.concat([hour, days, district], axis=1)
train_data['crime'] = crime

# Repeat for test data
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)

hour = test.Dates.dt.hour
hour = pd.get_dummies(hour)

test_data = pd.concat([hour, days, district], axis=1)

training, validation = train_test_split(train_data, train_size=.60)

features = ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
'Wednesday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']

training, validation = train_test_split(train_data, train_size=.60)
model = BernoulliNB()
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'], predicted)

# Logistic Regression for comparison
model = LogisticRegression(C=.01)
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'], predicted)

model = BernoulliNB()
model.fit(train_data[features], train_data['crime'])
predicted = model.predict_proba(test_data[features]) #MemoryError!!!!

# Write results
result = pd.DataFrame(predicted, columns=le_crime.classes_)
result.to_csv('testResult.csv', index=True, index_label='Id')

编辑:错误堆栈跟踪 ss: enter image description here

最佳答案

如果您尝试分块预测会怎样?例如,您可以尝试:

N_split = 10
split_data = np.array_split(test_data[features], N_split)
split_predicted = []
for data in split_data:
split_predicted.append( model.predict_proba(data) )

predicted = np.concatenate(split_predicted)

关于python - python 中的 sklearn `MemoryError`,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41149490/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com