python - scikit learn中partial_fit遇到的错误-6ren

python - scikit learn中partial_fit遇到的错误

转载作者：太空狗更新时间：2023-10-29 21:07:03

在 scikit learn 中使用 partial_fit 函数进行训练时，我在程序未终止的情况下收到以下错误，这怎么可能，即使经过训练的模型表现正确并提供正确的输出，这又是如何发生的？这有什么值得担心的吗？

/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py:207: RuntimeWarning: divide by zero encountered in log
  self.class_log_prior_ = (np.log(self.class_count_)

我正在使用以下修改后的训练函数，因为我必须维护一个恒定的标签\类列表，因为 partial_fit 不允许在后续运行中添加新类\标签，每批训练数据中的先验类是相同的:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets,classes=None, partial=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))
        X = self._vectorizer.fit_transform(X)
        y = self._encoder.fit_transform(y)



        if partial:
            classes=self._encoder.fit_transform(list(set(classes)))
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
            self._clf.fit(X, y)

        return self

同样在第二次调用 partial_fit 时，它会抛出类 count=2000 的以下错误，调用 model = self.train(featureset, classes=labels,partial=partial) 时训练样本为 3592:

self._clf.partial_fit(X, y, classes=list(set(classes)))
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 277, in partial_fit
    self._count(X, Y)
  File "/usr/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 443, in _count
    self.feature_count_ += safe_sparse_dot(Y.T, X)
ValueError: operands could not be broadcast together with shapes (2000,11430) (2000,10728) (2000,11430)

根据抛出的错误，我哪里出错了？这是否意味着我正在推送不正确的尺寸数据？我试过跟随，我现在打电话:

        X = self._vectorizer.transform(X)
        y = self._encoder.transform(y)

每次调用部分拟合。早些时候，我对每个 partialfit 调用都使用了 fittransform。这是正确的吗

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

经过多次尝试，我能够让以下代码正常工作，通过考虑第一次调用，但我假设分类器腌制文件的大小会在每次迭代后增加，但我为每批获得相同大小的 pkl 文件这是不可能的:

 class MySklearnClassifier(SklearnClassifier):

    def train(self, labeled_featuresets, classes=None, partial=False,firstcall=True):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:

           if firstcall:
                classes = self._encoder.fit_transform(np.unique(classes))
                X = self._vectorizer.fit_transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y, classes=classes)
           else:

                X = self._vectorizer.transform(X)
                y = self._encoder.fit_transform(y)
                self._clf.partial_fit(X, y)
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self

完整代码如下:

class postagger(ClassifierBasedTagger):
    """
    A classifier based postagger.
    """
    #MySklearnClassifier()
    def __init__(self, feature_detector=None, train=None,estimator=None,

                 classifierinstance=None, backoff=None,
                 cutoff_prob=None, verbose=True):

        if backoff is None:
            self._taggers = [self]
        else:
            self._taggers = [self] + backoff._taggers
        if estimator:
            classifier = MySklearnClassifier(estimator=estimator)
            #MySklearnClassifier.__init__(self, classifier)
        elif classifierinstance:
            classifier=classifierinstance






        if feature_detector is not None:
            self._feature_detector = feature_detector
            # The feature detector function, used to generate a featureset
            # or each token: feature_detector(tokens, index, history) -> featureset

        self._cutoff_prob = cutoff_prob
        """Cutoff probability for tagging -- if the probability of the
           most likely tag is less than this, then use backoff."""

        self._classifier = classifier
        """The classifier used to choose a tag for each token."""

        # if train and picklename:
        #     self._train(classifier_builder, picklename,tagged_corpus=train, ONLYERRORS=False,verbose=True,onlyfeatures=True ,LOADCONSTRUCTED=None)

    def legacy_getfeatures(self, tagged_corpus=None, ONLYERRORS=False, existingfeaturesetfile=None, verbose=True,
                           labels=artlabels):

        featureset = []
        labels=artlabels
        if not existingfeaturesetfile and tagged_corpus:
            if ONLYERRORS:

                classifier_corpus = open(tagged_corpus + '-ONLYERRORS.richfeature', 'w')
            else:
                classifier_corpus = open(tagged_corpus + '.richfeature', 'w')

            if verbose:
                print('Constructing featureset  for training corpus for classifier.')
            nlp = English()
            #df=pandas.DataFrame()
            store = HDFStore('featurestore.h5')



            for sentence in sPickle.s_load(open(tagged_corpus,'r')):
                untagged_words, tags, senindex = zip(*sentence)
                doc = nlp(u' '.join(untagged_words))
                # untagged_sentence, tags , rest = unpack_three(*zip(*sentence))
                for index in range(len(sentence)):
                    if ONLYERRORS:
                        if tags[index] == '<!SAME!>' and random.random() < 0.05:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)
                            # classifier_corpus.append((featureset, tags[index]))
                        elif tags[index] in labels:
                            featureset = self.new_feature_detector(doc, index)
                            sPickle.s_dump_elt((featureset, tags[index]), classifier_corpus)
                            featureset['label']=tags[index]
                            featureset['senindex']=str(senindex[0])
                            featureset['wordindex']=index
                            df=pandas.DataFrame([featureset])
                            store.append('df',df,index=False,min_itemsize = 150)


                        # classifier_corpus.append((featureset, tags[index]))
        # else:
        #     for element in sPickle.s_load(open(existingfeaturesetfile, 'w')):
        #         featureset.append( element)

        return tagged_corpus + '.richfeature'

    def _train(self, featuresetdata, classifier_builder=MultinomialNB(), partial=False, batchsize=500):
        """
        Build a new classifier, based on the given training data
        *tagged_corpus*.

        """



        #labels = set(cPickle.load(open(arguments['-k'], 'r')))
        if partial==False:
           print('Training classifier FULLMODE')
           featureset = []
           for element in sPickle.s_load(open(featuresetdata, 'r')):
               featureset.append(element)

           model = self._classifier.train(featureset, classes=artlabels, partial=False,firstcall=True)
           print('Training complete, dumping')
           try:
            joblib.dump(model,  str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.mpkl')
            print "joblib dumped"
           except:
               print "joblib error"
           cPickle.dump(model, open(str(featuresetdata) + '-FULLTRAIN ' + slugify(str(classifier_builder))[:10] +'.cmpkl', 'w'))
           print('dumped')
           return
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

        print('Training classifier each batch of {} training points'.format(batchsize))

        for i, batchelement in enumerate(batch(sPickle.s_load(open(featuresetdata, 'r')), batchsize)):
            featureset = []
            for element in batchelement:
                featureset.append(element)



            # model =  super(postagger, self).train (featureset, partial)
            # pdb.set_trace()
            # featureset = [item for sublist in featureset for item in sublist]
            trainsize = len(featureset)
            print("submitting {} training points for training\neg last one:".format(trainsize))
            for d, l in featureset:
                if len(d) != 113:
                    print d
                    assert False

            print featureset[-1]
            # pdb.set_trace()
            try:
                if i==0:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=True)
                else:
                    model = self._classifier.train(featureset, classes=artlabels, partial=True,firstcall=False)

            except:
                type, value, tb = sys.exc_info()
                traceback.print_exc()
                pdb.post_mortem(tb)

            print('Training for batch {} complete, dumping'.format(i))
            cPickle.dump(model, open(
                str(featuresetdata) + '-' + slugify(str(classifier_builder))[
                                            :10] + 'UPDATED batch-{} of {} points.mpkl'.format(
                    i, trainsize), 'w'))
            print('dumped')
        #joblib.dump(self._classifier,str(datetime.datetime.now().hour)+'-naivebayes.pickle',compress=0)

    def untag(self,tagged_sentence):
        """
        Given a tagged sentence, return an untagged version of that
        sentence.  I.e., return a list containing the first element
        of each tuple in *tagged_sentence*.

            >>> from nltk.tag.util import untag
            >>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
            ['John', 'saw', 'Mary']

        """

        return [w[0] for w in tagged_sentence]

    def evaluate(self, gold):
        """
        Score the accuracy of the tagger against the gold standard.
        Strip the tags from the gold standard text, retag it using
        the tagger, then compute the accuracy score.

        :type gold: list(list(tuple(str, str)))
        :param gold: The list of tagged sentences to score the tagger on.
        :rtype: float
        """
        gold_tokens=[]
        full_gold_tokens=[]

        tagged_sents = self.tag_sents(self.untag(sent) for sent in gold)
        for sentence in gold:#flatten the list

            untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature = zip(*sentence)


            gold_tokens.extend(zip(untagged_sentences,goldtags))
            full_gold_tokens.extend(zip( untagged_sentences, goldtags,type_feature,startpos_feature,sentence_feature,senindex_feature))





        test_tokens = sum(tagged_sents, []) #flatten the list
        getmismatch(gold_tokens,test_tokens,full_gold_tokens)
        return accuracy(gold_tokens, test_tokens)

    #
    def new_feature_detector(self, tokens, index):
        return getfeatures(tokens, index)


    def tag_sents(self, sentences):
        """
        Apply ``self.tag()`` to each element of *sentences*.  I.e.:

            return [self.tag(sent) for sent in sentences]
        """
        return [self.tag(sent) for sent in sentences]

    def tag(self, tokens):
        # docs inherited from TaggerI
        tags = []
        for i in range(len(tokens)):
            tags.append(self.tag_one(tokens, i))
        return list(zip(tokens, tags))

    def tag_one(self, tokens, index):
        """
        Determine an appropriate tag for the specified token, and
        return that tag.  If this tagger is unable to determine a tag
        for the specified token, then its backoff tagger is consulted.

        :rtype: str
        :type tokens: list
        :param tokens: The list of words that are being tagged.
        :type index: int
        :param index: The index of the word whose tag should be
            returned.
        :type history: list(str)
        :param history: A list of the tags for all words before *index*.
        """
        tag = None
        for tagger in self._taggers:
            tag = tagger.choose_tag(tokens, index)
            if tag is not None:  break
        return tag

    def choose_tag(self, tokens, index):
        # Use our feature detector to get the featureset.
        featureset = self.new_feature_detector(tokens, index)

        # Use the classifier to pick a tag.  If a cutoff probability
        # was specified, then check that the tag's probability is
        # higher than that cutoff first; otherwise, return None.

        if self._cutoff_prob is None:
            return self._classifier.prob_classify_many([featureset])
            #return self._classifier.classify_many([featureset])


        pdist = self._classifier.prob_classify_many([featureset])
        tag = pdist.max()
        return tag if pdist.prob(tag) >= self._cutoff_prob else None

最佳答案

1。 `RuntimeWarning`

您收到此警告是因为在 0 上调用了 np.log:

In [6]: np.log(0)
/home/anaconda/envs/python34/lib/python3.4/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
Out[6]: -inf

那是因为在你的一个调用中，一些类根本没有被表示(它们的计数为 0)因此 np.log 被调用为 0。你不必担心关于它。

2。类(class)先验

I am using the following modified training function as I have to maintain a constant list of labels\classes as the partial_fit does not allow adding new classes\labels on subsequent runs , the class prior is same in each batch of training data

如果您使用 partial_fit，您需要从一开始就传递标签/类列表，这是对的。
我不确定每批训练数据中的类之前是否相同。这可能有几种不同的含义，如果您能在这里阐明您的意思，那就太好了。
与此同时，MultinomialNB 等分类器的默认行为是它们先验拟合数据(基本上它们计算频率)。使用 partial_fit 时，它们将增量执行此计算，以便您获得与使用单个 fit 调用相同的结果。

3。你的错误

Also on the second call to partial_fit it throws following error for class count=2000 , and training samples are 3592 on calling model = self.train(featureset, classes=labels,partial=partial)

这里我们需要更多的细节。我很困惑 X 的形状是 (n_samples, n_features) 但在回溯中它看起来是 (2000,11430)。这意味着 X 有 2000 个样本。

错误确实意味着您输入的尺寸不一致。我建议为每个 partial_fit 调用打印 X.shape、y.shape after vectorization。

此外，您不应该在为每个 partial_fit 调用转换 X 的矢量化器上调用 fit 或 fit_transform :你应该适应它一次，然后只变换 X。这是为了确保你获得变换后的 X 的一致尺寸。

4。您的“早期”解决方案

这是您告诉我们您正在使用的代码:

class MySklearnClassifier(SklearnClassifier):
    def train(self, labeled_featuresets, classes=None, partial=False):
        """
        Train (fit) the scikit-learn estimator.

        :param labeled_featuresets: A list of ``(featureset, label)``
            where each ``featureset`` is a dict mapping strings to either
            numbers, booleans or strings.
        """

        X, y = list(compat.izip(*labeled_featuresets))

        if partial:
            classes = self._encoder.fit_transform(np.unique(classes))
            X = self._vectorizer.transform(X)
            y = self._encoder.transform(y)
            self._clf.partial_fit(X, y, classes=list(set(classes)))
        else:
             X = self._vectorizer.fit_transform(X)
             y = self._encoder.fit_transform(y)
             self._clf.fit(X, y)

        return self._clf

据我所知，这并没有太大问题，但我们确实需要更多关于您如何在此处使用它的上下文。
吹毛求疵:我觉得如果将 classes 变量作为类属性会更清楚，因为对于每个 partial_fit 调用，此变量都需要相同。
如果您将不同的值传递给构造函数 classes 参数，您可能会在这里做错事。

可以帮助我们帮助您的更多信息:

X.shape、y.shape 的打印。
上下文:您如何使用您提供的代码？？
_vectorizer、_encoder 使用什么？您最终使用的是什么分类器？

关于python - scikit learn中partial_fit遇到的错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32697093/

文章推荐： python - SQLAlchemy 多列相关更新

文章推荐： Python 模块 'os' 没有属性 'mknod'

文章推荐： c# - 关于在 Linq 中重用 db context 的意见

javascript - 我需要将文本放在一个中，它位于一个 Div 中，该 Div 位于另一个 Div 中，该 Div 位于另一个 Div 中
我需要将文本放在中在一个 Div 中，在另一个 Div 中，在另一个 Div 中。所以这是它的样子: #document Change PIN
html - 两个背景图像。一个在 HTML 中，一个在 BODY 中。在 Firefox 中，主体图像未呈现
奇怪的事情发生了。我有一个基本的 html 代码。 html，头部， body 。(因为我收到了一些反对票，这里是完整的代码) 这是我的CSS: html { backgroun
ios - 将图像从 asset.xcassets 加载到 imageArray 中，并将其动态加载到 UIImageView 中，该 UIImageView 存在于 UICollectionView 中 - swift
我正在尝试将 Assets 中的一组图像加载到 UICollectionview 中存在的 ImageView 中，但每当我运行应用程序时它都会显示错误。而且也没有显示图像。我在ViewDidLoa
linux - 在 BASH 中，我需要根据 perl 脚本的输出更改一些环境变量。在 tcsh 中，我可以使用别名 eval 组合。不能在 bash 中
我需要根据带参数的 perl 脚本的输出更改一些环境变量。在 tcsh 中，我可以使用别名命令来评估 perl 脚本的输出。 tcsh: alias setsdk 'eval `/localhome/
asp.net - Windows 身份验证适用于 IIS，但不适用于 Kestrel/Microsoft.AspNetCore.Authentication.Negotiate(不在 Chrome 中，有时在 Edge 中，始终在 IE 中)？
我使用 Windows 身份验证创建了一个新的 Blazor(服务器端)应用程序，并使用 IIS Express 运行它。它将显示一条消息“Hello Domain\User!”来自右上方的以下 Ra
java - java 中 Kotlin 中的等价物是什么？
这是我的方法 void login(Event event);我想知道 Kotlin 中应该如何最佳答案在 Kotlin 中通配符运算符是 * 。它指示编译器它是未知的，但一旦知道，就不会有其他类
express - 在 Jade 中，为什么有时我可以按原样使用变量而有时必须将它们包含在#{......} 中？
看下面的代码 for story in book if story.title.length < 140 - var story
c - C 中 strstr() 中 for 循环的错误使用
我正在尝试用 C 语言学习字符串处理。我写了一个程序，它存储了一些音乐轨道，并帮助用户检查他/她想到的歌曲是否存在于存储的轨道中。这是通过要求用户输入一串字符来完成的。然后程序使用 strstr()
c - * 在 sscanf 中，* 在 [] 中
我正在学习 sscanf 并遇到如下格式字符串: sscanf("%[^:]:%[^*=]%*[*=]%n",a,b,&c); 我理解 %[^:] 部分意味着扫描直到遇到 ':' 并将其分配给 a。:
python - 在 Python (2.7.3) 中，如果 str(x) 中的任何字符在 str(y) 中(或 str(y) 在 str(x) 中)，我如何编写一个函数来回答？
def char_check(x,y): if (str(x) in y or x.find(y) > -1) or (str(y) in x or y.find(x) > -1):
ansible - 在 Ansible 中，如何将一行移动到一个 block 中？
我有一种情况，我想将文本文件中的现有行包含到一个新 block 中。 line 1 line 2 line in block line 3 line 4 应该变成 line 1 line 2 line
Django 调试工具栏显示在根 URL 中，但不显示在应用程序 URL 中
我有一个新项目，我正在尝试设置 Django 调试工具栏。首先，我尝试了快速设置，它只涉及将 'debug_toolbar' 添加到我的已安装应用程序列表中。有了这个，当我转到我的根 URL 时，调试
r - 在 R 中，Matlab 中 @ 函数句柄的等价物是什么？
在 Matlab 中，如果我有一个函数 f，例如签名是 f(a,b,c)，我可以创建一个只有一个变量 b 的函数，它将使用固定的 a=a1 和 c=c1 调用 f: g = @(b) f(a1, b,
swiftui - SwiftUI 中 ScrollView 中 VStack 元素中的神秘间距或填充
我不明白为什么 ForEach 中的元素之间有多余的垂直间距在 VStack 里面在 ScrollView 里面使用 GeometryReader 时渲染自定义水平分隔线。 Scrol
cookies - 什么应该存储在 session 中，什么应该存储在 cookie 中？
我想知道，是否有关于何时使用 session 和 cookie 的指南或最佳实践？什么应该和什么不应该存储在其中？谢谢! 最佳答案这些文档很好地了解了 session cookie 的安全问题以及
python - Python 中 matplotlib 中 3d 直方图的奇怪行为
我在 scipy/numpy 中有一个 Nx3 矩阵，我想用它制作一个 3 维条形图，其中 X 轴和 Y 轴由矩阵的第一列和第二列的值、高度确定每个条形的是矩阵中的第三列，条形的数量由 N 确定。
c - c 中 sem_init(...) 中 value 参数的不同用法
假设我用两种不同的方式初始化信号量 sem_init(&randomsem,0,1) sem_init(&randomsem,0,0) 现在， sem_wait(&randomsem) 在这两种情况下
c - 实际值存储在 pstr 中，但是该值如何存储在数组 "WORD"中
我怀疑该值如何存储在“WORD”中，因为 PStr 包含实际输出。？既然Pstr中存储的是小写到大写的字母，那么在printf中如何将其给出为“WORD”。有人可以吗？解释一下？ #include
javascript - 数组索引选择像在 numpy 中，但在 javascript 中
我有一个 3x3 数组: var my_array = [[0,1,2], [3,4,5], [6,7,8]]; 并想获得它的第一个 2
javascript - 在 Javascript 中，如何检测浏览器窗口何时在 View 中？
我意识到您可以使用如下方式轻松检查焦点: var hasFocus = true; $(window).blur(function(){ hasFocus = false; }); $(win

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城