gpt4 book ai didi

c# - Numl.net 中训练数据的后备存储并添加到其中以提高准确性

转载 作者:行者123 更新时间:2023-11-30 21:54:30 25 4
gpt4 key购买 nike

我对 Numl.net library 非常感兴趣,它可以扫描收到的电子邮件并提取数据位。例如,假设我想从电子邮件中提取客户引用编号,该编号可能位于主题行或正文内容中。

void Main()
{
// get the descriptor that describes the features and label from the training objects
var descriptor = Descriptor.Create<Email>();

// create a decision tree generator and teach it about the Email descriptor
var decisionTreeGenerator = new DecisionTreeGenerator(descriptor);

// load the training data
var repo = new EmailTrainingRepository(); // inject this
var trainingData = repo.LoadTrainingData(); // returns List<Email>

// create a model based on our training data using the decision tree generator
var decisionTreeModel = decisionTreeGenerator.Generate(trainingData);

// create an email that should find C4567890
var example1 = new Email
{
Subject = "Regarding my order C4567890",
Body = "I am very unhappy with your level of service. My order has still not arrived."
};

// create an email that should find C89779237
var example2 = new Email
{
Subject = "I want to return my goods",
Body = "My customer number is C89779237 and I want to return my order."
};

// create an email that should find C3239544-1
var example3 = new Email
{
Subject = "Customer needs an electronic invoice",
Body = "Please reissue the invoice as a PDF for customer C3239544-1."
};

var email1 = decisionTreeModel.Predict<Email>(example1);
var email2 = decisionTreeModel.Predict<Email>(example2);
var email3 = decisionTreeModel.Predict<Email>(example3);

Console.WriteLine("The example1 was predicted as {0}", email1.CustomerNumber);


if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email1);
}

Console.WriteLine("The example2 was predicted as {0}", email2.CustomerNumber);
if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email2);
}

Console.WriteLine("The example3 was predicted as {0}", email3.CustomerNumber);
if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email3);
}
}

// Define other methods and classes here
public class Email
{
// Subject
[Feature]
public string Subject { get; set; }

// Body
[Feature]
public string Body { get; set; }

[Label]
public string CustomerNumber { get; set; } // This is the label or value that we wish to predict based on the supplied features
}

static bool ReadBool(string question)
{
while (true)
{
Console.WriteLine(question);
String r = (Console.ReadLine() ?? "").ToLower();
if (r == "y")
return true;
if (r == "n")
return false;
Console.WriteLine("!!Please Select a Valid Option!!");
}
}

虽然有些事情我还没有完全掌握。

  1. 在受监督的网络中,我是否需要在每次运行应用程序时重新构建决策树,或者我能否以某种方式存储它,然后在需要时重新加载它?我试图节省处理时间,以便每次都重建该决策树。

  2. 此外,网络是否可以在数据经过人工验证后不断添加自己的训练数据? IE。我们有一个初始训练集,网络决定一个结果,如果一个人说“做得好”,新的例子就会被添加到训练集中以改进它。当网络出错时,反之亦然。我想我可以在人类验证预测正确后添加到训练集吗?我的 repo.Add(email) 看起来是一种合乎逻辑的方式吗?

  3. 如果我确实添加到训练数据中,训练数据在什么时候会变得“超过要求”?

最佳答案

我认为这不是使用机器学习解决的好问题(尽管我对您的发现很感兴趣)。我担心的是客户数量会随着时间的推移而变化,需要您每次都重新生成模型。朴素贝叶斯、决策树、逻辑回归和支持向量机等二元分类算法要求您提前了解每个类别(即客户引用号)。

您可以尝试使用特征工程并预测给定的词是否是客户引用编号(即 1 或 0)。为此,您只需设计如下所示的功能:

  1. IsWordStartsWithC ( bool )
  2. 字长
  3. 位数/字长
  4. 字母数/字长

然后使用决策树或逻辑回归分类器来预测该词是否为 CRN。要从电子邮件中提取 CRN,只需遍历电子邮件中的每个单词,如果 Model.Predict(word) 输出 1,您希望已捕获该电子邮件的 CRN。

这个方法应该不需要重新训练。

  1. In a supervised network, do I need to re-build the decision tree every time I run the application, or can I store it off somehow and then reload it as and when required? I'm trying to save the processing time in order to rebuild that decision tree every time.

您可以通过 Model.Save() 方法使用任何流对象存储生成的模型。 numl 中的所有监督模型目前都实现了这个基类。除了神经网络模型,它们应该保存得很好。

  1. Also, can the network continually add to it's own training data as the data gets validated by a human? I.e. we have an initial training set, the network decides on an outcome and if a human says 'well done' the new example gets added to the training set in order to improve it. Also vice versa when the network gets it wrong. I assume I can just add to the training set once a human has validated that a prediction is correct? Does my repo.Add(email) seem like a logical way to do this?

这是一个很好的强化学习示例。目前 numl 没有实现这个,但希望在不久的将来会实现:)

  1. If I do add to the training data, at what point does the training data become "more than required"?

检查这一点的最佳方法是通过验证训练集和测试集的准确性度量。您可以在测试集的准确性提高的同时继续添加更多训练数据。如果您发现准确度在测试集上下降而在训练集上继续上升,则表明它现在过度拟合,停止添加更多数据是安全的。

关于c# - Numl.net 中训练数据的后备存储并添加到其中以提高准确性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33010711/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com