- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我对机器学习非常陌生,我想知道是否有人可以带我完成这段代码以及为什么它不起作用。这是我自己的 scikit-learn 教程的变体,位于:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html这基本上就是我想做的。我需要使用带标签的训练集训练模型,以便当我使用测试集时,它可以预测测试集的标签。另外,如果有人可以向我展示如何保存和加载模型,那将非常有用。非常感谢。这是我到目前为止所拥有的:
import codecs
import os
import numpy as np
import pandas as pd
from Text_Pre_Processing import Pre_Processing
filenames = os.listdir(
"...scikit-machine-learning/training_set")
files = []
array_data = []
array_label = []
for file in filenames:
with codecs.open("...scikit-machine-learning/training_set/" + file, "r",
encoding='utf-8', errors='ignore') as file_data:
open_file = file_data.read()
open_file = Pre_Processing.lower_case(open_file)
open_file = Pre_Processing.remove_punctuation(open_file)
open_file = Pre_Processing.clean_text(open_file)
files.append(open_file)
# ----------------------------------------------------
# PUTTING LABELS INTO LIST
for file in files:
if 'inheritance' in file:
array_data.append(file)
array_label.append('Inheritance (object-oriented programming)')
elif 'pagerank' in file:
array_data.append(file)
array_label.append('PageRank')
elif 'vector space model' in file:
array_data.append(file)
array_label.append('Vector Space Model')
elif 'bayes' in file:
array_data.append(file)
array_label.append('Bayes' + "'" + ' Theorem')
else:
array_data.append(file)
array_label.append('Dynamic programming')
#----------------------------------------------------------
csv_array = []
for i in range(0, len(array_data)):
csv_array.append([array_data[i], array_label[i]])
# format of array [[string, label], [string, label], [string, label]]
import csv
with open('data.csv', 'w') as target:
writer = csv.writer(target)
writer.writerows(zip(test_array))
data = pd.read_csv('data.csv')
numpy_array = data.as_matrix()
X = numpy_array[:, 0]
Y = numpy_array[:, 1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline(['vect', CountVectorizer(stop_words='english'), 'tfidf', TfidfTransformer(),
'clf', MultinomialNB()])
text_clf = text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)
我看到网上有人使用csv文件输入数据,所以我也尝试了,我可能不需要它,所以如果这是不正确的,我深表歉意。
显示错误:
C:.../scikit-machine-learning/train.py:63: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
numpy_array = data.as_matrix()
Traceback (most recent call last):
File "C:/...scikit-machine-learning/train.py", line 66, in <module>
Y = numpy_array[:,1]
IndexError: index 1 is out of bounds for axis 1 with size 1
非常感谢您的帮助,如果您需要进一步的解释,请告诉我。
csv 中两个条目的示例:
"['dynamic programming is an algorithmic technique used to solve certain optimization problems where the object is to find the best solution from a number of possibilities it uses a so called bottomup approach meaning that the problem is solved as a set of subproblems which in turn are made up of subsubproblemssubproblems are then selected and used to solve the overall problem these subproblems are only solved once and the solutions are saved so that they will not need to be recalculated again whilst calculated individually they may also overlap when any subproblem is met again it can be found and reused to solve another problem since it searches all possibilities it is also very accurate this method is far more efficient than recalculating and therefore considerably reduces computation it is widely used in computer science and can be applied for example to compress data in high density bar codes dynamic programming is most effective and therefore most often used on objects that are ordered from left to right and whose order cannot be rearranged this means it works well on character chains for example ', 'Dynamic programming']"
"['inheritance is one of the basic concepts of object oriented programming its objective is to add more detail to preexisting classes whilst still allowing the methods and variables of these classes to be reused the easiest way to look at inheritance is as an is a kind of relationship for example a guitar is a kind of string instrument electric acoustic and steel stringed are all types of guitar the further down an inheritance tree you get the more specific the classes become an example here would be books books generally fall into two categories fiction and nonfiction each of these can then be subdivided into more groups fiction for example can be split into fantasy horror romance and many more nonfiction splits the same way into other topics such as history geography cooking etc history of course can be subdivided into time periods like the romans the elizabethans the world wars and so on', 'Inheritance (object-oriented programming)']"
最佳答案
您需要从 csv 中删除字符 [' 和 '],因为 read_csv 将它们视为字符串(一列)而不是两列数据帧。text_clf = Pipeline 行上还有一个拼写错误,所以我也修复了它。祝你好运!
data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()
strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, Y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)
关于machine-learning - 机器学习/NLP文本分类: training a model from corpus of text files - scikit learn,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51686078/
我对 mongoosejs 中模型的使用感到有些困惑。 可以通过这些方式使用 mongoose 创建模型 使用 Mongoose var mongoose = require('mongoose');
我正在看 from django.db import models class Publisher(models.Model): name = models.CharField(max_len
我有自己的 html 帮助器扩展,我用这种方式 model.Reason_ID, Register.PurchaseReason) %> 这样声明的。 public static MvcHtmlS
假设模型原本是存储在CPU上的,然后我想把它移到GPU0上,那么我可以这样做: device = torch.device('cuda:0') model = model.to(device) # o
我过去读过一些关于模型的 MVC 建议,指出不应为域和 View 重用相同的模型对象;但我找不到任何人愿意讨论为什么这很糟糕。 我认为创建两个单独的模型 - 一个用于域,一个用于 View - 然后在
我正在使用pytorch构建一个像VGG16这样的简单模型,并且我已经重载了函数forward在我的模型中。 我发现每个人都倾向于使用 model(input)得到输出而不是 model.forwar
tf.keras API 中的 models 是否多余?对于某些情况,即使不使用 models,代码也能正常运行。 keras.models.sequential 和 keras.sequential
当我尝试使用 docker 镜像运行 docker 容器时遇到问题:tensorflow/serving。 我运行命令: docker run --name=tf_serving -it tensor
我有一个模型,我用管道注册了它: register_step = PythonScriptStep(name = "Register Model",
如果 View 需要访问模型中的数据,您是否认为 Controller 应: a)将模型传递给 View b)将模型的数据传递给 View c)都不;这不应该是 Controller 所关心的。让 V
我正在寻找一个可以在模型中定义的字段,该字段本质上是一个列表,因为它将用于存储多个字符串值。显然CharField不能使用。 最佳答案 您正在描述一种多对一的关系。这应该通过一个额外的 Model 进
我最近了解了 Django 中的模型继承。我使用很棒的包 django-model-utils 取得了巨大的成功。我继承自 TimeStampedModel 和 SoftDeletableModel。
我正在使用基于 resnet50 的双输出模型进行项目。一个输出用于回归任务,第二个输出用于分类任务。 我的主要问题是关于模型评估。在训练期间,我在验证集的两个输出上都取得了不错的结果: - 综合损失
我是keras的新手。现在,我将使用我使用 model.fit_generator 训练的模型来预测测试图像组。我可以使用 model.predict 吗?不确定如何使用model.predict_g
在 MVC 应用程序中,我加入了多个表并将其从 Controller 返回到 View,如下所示: | EmployeeID | ControlID | DoorAddress | DoorID |
我在使用 sails-cassandra 连接系统的 Sails 中有一个 Data 模型。数据。 Data.count({...}).exec() 返回 1,但 Data.find({...}).e
我正在使用 PrimeFaces dataTable 开发一个 jsf 页面来显示用户列表。用户存储在 Model.User 类的对象中。
我正在关注https://www.tensorflow.org/tutorials/keras/basic_classification解决 Kaggle 挑战。 但是,我不明白应该将什么样的数据输入
我是这个领域的新手。那么,你们能帮忙如何为 CNN 创建 .config 文件吗? 传递有关如何执行此操作的文档或教程将对我有很大帮助。谢谢大家。 最佳答案 这个问题对我来说没有多大意义,因为 .co
我是“物理系统建模”主题的新手。我阅读了一些基础文献,并在 Modelica 和 Simulink/Simscape 中做了一些教程。我想问你,如果我对以下内容理解正确: 符号操作是将微分代数方程组(
我是一名优秀的程序员,十分优秀!