gpt4 book ai didi

python - 机器学习: Predict second dataset on behalf of first dataset trained classifier

转载 作者:行者123 更新时间:2023-11-30 09:14:08 25 4
gpt4 key购买 nike

我是“机器学习”的新手,并尝试实现 this question但我不清楚。我已经诱惑了两个月了,所以请帮助我解决我的错误。

实际上,我正在尝试:

  1. 从形状TRAIN_dataset中提取的TRAIN_featuresTRAIN_labels上“训练svm分类器”(98962 ,) 和尺寸 98962
  2. “测试 svm 分类器”,对从另一个数据集提取的 TEST_features 进行测试,即相同形状的 TEST_dataset (98962,) 和大小 98962TRAIN_dataset

“TfidfVectorizer”的帮助下,对“TRAIN_features”“TEST_features”进行“预处理”后强>我矢量化了我的两个特征。之后我再次计算了两个特征的形状和大小,即

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)

“processed_TRAIN_features”大小变为1032665“shape”变为(98962, 9434)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

“processed_TEST_features”大小变为1457961“shape”变为(98962, 10782)

我知道何时processed_TRAIN_features上“训练” svm 分类器以及何时“预测”“processed_TEST_features” strong>使用相同的分类器,会产生错误,因为两个特征的“形状”“大小”变得不同。

我认为,这个问题的唯一解决方案是“ reshape ”稀疏矩阵(numpy.float64)processed_TEST_featuresprocessed_TRAIN_features ...我认为只有当其大小小于“processed_TEST_features”时才可能 reshape 为“processed_TRAIN_features”,或者还有其他方法可以实现我的上述观点(1,2 )。我无法实现this question关于我的问题,仍在寻找它将如何变得等于“processed_TEST_features”w.r.t形状和大小。

如果有人可以帮我做这件事,请...提前致谢。

完整代码如下:

DataPath2     = ".../train.csv"
TRAIN_dataset = pd.read_csv(DataPath2)

DataPath1 = "..../completeDATAset.csv"
TEST_dataset = pd.read_csv(DataPath1)

TRAIN_features = TRAIN_dataset.iloc[:, 1 ].values
TRAIN_labels = TRAIN_dataset.iloc[:,0].values

TEST_features = TEST_dataset.iloc[:, 1 ].values
TEST_labeels = TEST_dataset.iloc[:,0].values
lab_enc = preprocessing.LabelEncoder()
TEST_labels = lab_enc.fit_transform(TEST_labeels)

processed_TRAIN_features = []

for sentence in range(0, len(TRAIN_features)):
# Remove all the special characters
processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))

# remove all single characters
processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

#remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)

# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)

# remove special symbols
processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)

# Remove single characters from the start
processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

# Substituting multiple spaces with single space
processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

#remove links
processed_feature = re.sub(r"http\S+", "", processed_feature)

# Removing prefixed 'b'
processed_feature = re.sub(r'^b\s+', '', processed_feature)

#removing rt
processed_feature = re.sub(r'^rt\s+', '', processed_feature)

# Converting to Lowercase
processed_feature = processed_feature.lower()

processed_TRAIN_features.append(processed_feature)

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)


processed_TEST_features = []

for sentence in range(0, len(TEST_features)):
# Remove all the special characters
processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))

# remove all single characters
processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)

#remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)

# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)

# remove special symbols
processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)

# Remove single characters from the start
processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)

# Substituting multiple spaces with single space
processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)

#remove links
processed_feature1 = re.sub(r"http\S+", "", processed_feature1)

# Removing prefixed 'b'
processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)

#removing rt
processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)

# Converting to Lowercase
processed_feature1 = processed_feature1.lower()

processed_TEST_features.append(processed_feature1)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)

text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)

text_classifier.fit(X_train_data, y_train_data)

text_classifier.predict(processed_TEST_features)

标题编辑:预测数据集分类=>预测数据集

最佳答案

processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))

关于python - 机器学习: Predict second dataset on behalf of first dataset trained classifier,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59508927/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com