python - doc2vec 时出现“utf-8”编解码器错误-6ren

python - doc2vec 时出现“utf-8”编解码器错误

转载作者：太空宇宙更新时间：2023-11-03 15:08:55

无法运行程序并出现解码错误。实际上，我正在使用 gensim 并尝试 Doc2vec 库，在执行此操作时我收到此错误？代码:-

def to_array(self):
    self.sentences = []
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                self.sentences.append(LabeledSentence(
                    utils.to_unicode(line).split(), [prefix + '_%s' % 
item_no]))
    return self.sentences

sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, dm_mean=0, sample=1e-5, 
negative=5, workers=12)
model.build_vocab(sentences.to_array())

错误:-

File "<ipython-input-88-eab20df20acc>", line 75, in <module>
model.build_vocab(sentences.to_array())

File "<ipython-input-88-eab20df20acc>", line 58, in to_array
utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))

File "C:\Users\summert\AppData\Local\Continuum\Anaconda3\lib\site-
packages\gensim\utils.py", line 235, in any2unicode
return unicode(text, encoding, errors=errors)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 21: 
invalid continuation byt

最佳答案

看起来这个 anaconda gensim 程序在需要 utf-8 时正在获取一个字节。 model.build_vocab(sentences.to_array()) 没有输入它想要的类型。

你在哪里找到to_unicode的？ “utils”是从哪里导入的？我不认为这是常规的 Python 3。看看 this 。

鉴于您使用的是 Python 3，您可能不需要任何东西。

直接替换即可

(LabeledSentence(utils.to_unicode(line).split()...

与

(LabeledSentence(line.split()...

如果这不起作用，请尝试:

 (LabeledSentence(line.encode('utf-8').split()...

关于python - doc2vec 时出现“utf-8”编解码器错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44405685/

文章推荐： C#继承类型转换错误

文章推荐： apache - 如何将 HTTP 永久重定向到 HTTPS SSL url？

文章推荐： java - Jetty:如何验证应用程序代码中的 SSL 客户端证书？

文章推荐： python - Pandas Dataframe 的切换轴

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - doc2vec 时出现“utf-8”编解码器错误