gpt4 book ai didi

python - 测试和训练数据的 R2 分数即将变为 0

转载 作者:行者123 更新时间:2023-11-30 10:00:16 25 4
gpt4 key购买 nike

我正在尝试实现一个用于房屋定价的套索模型,但它预测测试和训练数据的 r2_score 为 0.00。谁能帮我解决我哪里出错了。

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

train = pd.read_csv(r"C:\train - Copy.csv")
test = pd.read_csv(r"C:\test.csv")

# CHECKING SHAPE OF THE DATA
print("Train Data shape", train.shape,"\n Test Data shape",test.shape)

# Save the 'Id' column
train_ID = train['id']
test_ID = test['id']

# Droping COLUMNS WHICH HAS NO IMPACT ON DATA
train = train.drop(['id', 'thumbnail_url'], axis=1)
test = test.drop(['id', 'thumbnail_url'], axis=1)

# Check data size after dropping no impact variables
print("\nThe train data size after dropping features is : {} ".format(train.shape))
print("The test data size after dropping featurea is : {} ".format(test.shape))

# Checking Categorical Data
C_data = train.select_dtypes(include=['object']).columns
print("Categorical Data", C_data)

# Checking Numerical Data
N_data = train.select_dtypes(include=['int64', 'float64']).columns
print("Numerical Data", N_data)

# Combining Datasets
ntrain = train.shape[0]
ntest = test.shape[0]
#y_train = train.log_price.values
y = train.log_price.values

print(ntrain)
print(ntest)
print(y)

all_data = pd.concat((train, test),sort='true').reset_index(drop=True)
print(all_data.shape)
all_data = all_data.drop(['log_price'], axis=1)
print(all_data.shape)


# Find Missing Ratio of Dataset
null_values = all_data.isnull().sum()
# print(null_values)

# IMPUTING NULL VALUES
all_data = all_data.dropna(subset=['host_since'])
all_data['bathrooms'] = all_data['bathrooms'].fillna(all_data['bathrooms'].mean())
all_data['bedrooms'] = all_data['bedrooms'].fillna(all_data['bedrooms'].mean())
all_data['beds'] = all_data['beds'].fillna(all_data['beds'].mean())
all_data['review_scores_rating'] = all_data['review_scores_rating'].fillna(all_data['review_scores_rating'].mean())
all_data['host_response_rate'] = all_data['host_response_rate'].fillna('None')
all_data['neighbourhood'] = all_data['neighbourhood'].fillna('None')
all_data['host_has_profile_pic'] = all_data['host_has_profile_pic'].fillna('f')
all_data['host_identity_verified'] = all_data['host_identity_verified'].fillna('f')
all_data['description'] = all_data['description'].fillna('None')
all_data['first_review'] = all_data['first_review'].fillna('None')
all_data['last_review'] = all_data['last_review'].fillna('None')
all_data['name'] = all_data['name'].fillna('None')
all_data['zipcode'] = all_data['zipcode'].fillna('None')

# Check if Missing values left
post_null_values = all_data.isnull().sum().sum()
print("post_null_values\n", post_null_values)

print("-----------------------------------------------------------------------------------------------")

# apply LabelEncoder to categorical features
from sklearn.preprocessing import LabelEncoder
cols = ('property_type', 'room_type', 'amenities', 'bed_type',
'cancellation_policy', 'city', 'description', 'first_review',
'host_has_profile_pic', 'host_identity_verified', 'host_response_rate',
'host_since', 'instant_bookable', 'last_review', 'name',
'neighbourhood', 'zipcode')
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))

# creating matrices for sklearn:
X = all_data[:ntrain]
test_values = all_data[ntrain:]

print("X col", X.columns, "X shape", X.shape)

# import train test split
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#clf = LinearRegression()
clf = Lasso()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)

from sklearn.metrics import r2_score

print("Train acc: " , r2_score(y_train, y_train_pred))
print("Test acc: ", r2_score(y_test, y_pred))

from sklearn.metrics import mean_squared_error

print("Train acc: " , clf.score(X_train, y_train))
print("Test acc: ", clf.score(X_test, y_test))

输出:火车号码:0.0001732000413904311测试通过:0.00011093390171657003火车号码:0.0001732000413904311测试记录:0.00011093390171657004

最佳答案

这并不罕见。你的回归非常糟糕。我建议你再重做一次。对于回归,还请注意,您可以有负的 r_2_score: How is the R2 value in Scikit learn calculated?

您还可以使用虚拟来查看您的基线是什么?如何预测每个 y_test 值是所有 y_train 值的平均值。那么您会得到哪个r_2

调整它的一些想法是:使用功能(留下一些),选择另一个回归模型等等

关于python - 测试和训练数据的 R2 分数即将变为 0,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59299388/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com