gpt4 book ai didi

c# - 文本分类 NaiveBayes

转载 作者:行者123 更新时间:2023-11-30 09:18:52 25 4
gpt4 key购买 nike

我正在尝试按类别对一系列文本示例新闻进行分类。我有大量新闻文本数据集,数据库中有类别。机器应该经过训练并决定新闻类别。

    public static string[] Tokenize(string text)
{
StringBuilder sb = new StringBuilder(text);

char[] invalid = "!-;':'\",.?\n\r\t".ToCharArray();

for (int i = 0; i < invalid.Length; i++)
sb.Replace(invalid[i], ' ');

return sb.ToString().Split(new[] { ' ' }, System.StringSplitOptions.RemoveEmptyEntries);
}
private void Form1_Load(object sender, EventArgs e)
{
string strDSN = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source = c:\\users\\158820\\Documents\\Database4.accdb";
string strSQL = "SELECT * FROM NewsRepository";
// create Objects of ADOConnection and ADOCommand
OleDbConnection myConn = new OleDbConnection(strDSN);
OleDbDataAdapter myCmd = new OleDbDataAdapter(strSQL, myConn);
myConn.Open();
DataSet dtSet = new DataSet();
myCmd.Fill(dtSet, "NewsRepository");
DataTable dTable = dtSet.Tables[0];
myConn.Close();

StringBuilder sWords = new StringBuilder();
string[][] swords = new string[dTable.Rows.Count][];
int i = 0;

foreach (DataRowView dr in dTable.DefaultView)
{
swords[i] = Tokenize(dr[1].ToString());
i++;
}

Codification codebook = new Codification(dTable, new string[] { "NewsTitle", "Category" });
DataTable symbols = codebook.Apply(dTable);
int[][] inputs = symbols.ToJagged<int>(new string[] { "NewsTitle" });
int[] outputs = symbols.ToArray<int>("Category");

bagOfWords(inputs, outputs);
}


private static void bagOfWords(int[][] inputs, int[] outputs)
{
var bow = new BagOfWords<int>();
var quantizer = bow.Learn(inputs);
string filenamebow = Path.Combine(Application.StartupPath, "News_BOW.accord");
Serializer.Save(obj: bow, path: filenamebow);
double[][] histograms = quantizer.Transform(inputs);

// One way to perform sequence classification with an SVM is to use
// a kernel defined over sequences, such as DynamicTimeWarping.

// Create the multi-class learning algorithm as one-vs-one with DTW:
var teacher = new MulticlassSupportVectorLearning<ChiSquare, double[]>()
{
Learner = (p) => new SequentialMinimalOptimization<ChiSquare, double[]>()
{
// Complexity = 100 // Create a hard SVM
}
};

// Learn a multi-label SVM using the teacher
var svm = teacher.Learn(histograms, outputs);

// Get the predictions for the inputs
int[] predicted = svm.Decide(histograms);

// Create a confusion matrix to check the quality of the predictions:
var cm = new GeneralConfusionMatrix(predicted: predicted, expected: outputs);

// Check the accuracy measure:
double accuracy = cm.Accuracy;

string filename = Path.Combine(Application.StartupPath, "News_SVM.accord");
Serializer.Save(obj: svm, path: filename);
}

我对如何训练 Accord.net 对象有点困惑。我能够序列化经过训练的模型(大约 106 MB,包含 9 个类别的 3600 条独特新闻)

如何使用模型来预测一组新新闻文本的类别?

最佳答案

在不在训练集中的数据上使用模型就像调用支持向量机做出另一个决定一样简单:

svm.Decide(outofSampleData)

由于您已经序列化了经过训练的模型,因此您可以使用 Serializer.Load<T> 实例化 svm 对象。记录在案here .

关于c# - 文本分类 NaiveBayes,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47505910/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com