python - 错误 : 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab-6ren

python - 错误 : 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab

转载作者：行者123 更新时间：2023-12-04 17:14:15

25

4

import PyPDF4
from google.colab import files
files.upload()
fileReader = PyPDF4.PdfFileReader('ITC-1.pdf')
s=""
for i in range(2, fileReader.numPages):
    s+=fileReader.getPage(i).extractText()


sentences = []
while s.find('.') != -1:
    index = s.find('.')
    sentences.append(s[:index])
    s = s[index+1:]

text_ds = tf.data.TextLineDataset('ITC-1.pdf').filter(lambda x: tf.cast(tf.strings.length(x), bool))
vectorize_layer.adapt(text_ds.batch(1024))
inverse_vocab = vectorize_layer.get_vocabulary()

上面代码的最后一行显示了错误。我看了几个帖子来理解它的含义，但似乎没有一个解决方案对我有用。我无法使用我的本地机器，因为我需要访问 GPU。请为此提出解决方法。谢谢!

PS:按照这里的代码https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb#scrollTo=haJUNjSB60Kh ，不同之处在于我阅读文件的方式。如果有更好的方法，请告诉我!

最佳答案

import pdfplumber
from tensorflow.keras.layers.experimental import preprocessing
import tensorflow as tf

f = open('test.txt', 'w')

with pdfplumber.open(r'test.pdf') as pdf:
    for page in pdf.pages:
      f.write(page.extract_text())
f.close()
layer = preprocessing.TextVectorization()
text_ds = tf.data.TextLineDataset('test.txt').filter(lambda x: tf.cast(tf.strings.length(x), bool))

layer.adapt(text_ds.batch(1024))
inverse_vocab = layer.get_vocabulary()

你可以这样做:

使用 pdfplumber 阅读 pdf。
将页面写入文本文件。
然后使用该文本文件创建数据集。

关于python - 错误 : 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68982894/

25

4

0

文章推荐： azure - 为 VMSS 添加 Azure 诊断设置

文章推荐： image - Golang Overlay 图像总是黑白的

文章推荐： python - 从多索引表中获取最大行

文章推荐： webpack - 如何在 webpack 5 中更改 web worker 文件的输出格式

Python - 字符串中的 xb
我对Python很陌生，在这段代码中，我试图编写一个代码来读取包含城市列表及其各自的经度和纬度的文本文件，然后将它们作为包含城市列表的字典返回。城市，包括其经度和纬度。文本文件如下所示: Name:
java - 将 a^xb^x 与正则表达式匹配
澄清一下，我想匹配: ab aabb aaabbb ... 这在 Perl 中有效: if ($exp =~ /^(a(?1)?b)$/) 要理解这一点，请看一下字符串，就好像它是从外向内生长的，而不
java - 使用正则匹配a^xb^x，其中x为a,b出现的次数
澄清一下，我想知道如何使用正则表达式来匹配: ab aabb aaabbb ... 我刚刚发现这在 Perl 中有效: if ($exp =~ /^(a(?1)?b)$/) 要理解这一点，请看一下字符
javascript - 无法使用 XMLBeans (XB Projector) 打印结果
我已将 Individual.XML 放置在目录中:Files\InputApps - - **A123** 只是想打印出“A123”，但我收到此错误: Exception in thread "
c++ - Eigen : solving A=xB (Assert failed)
我正在尝试将一些 matlab 代码转换为 C++ 我正在使用 Eigen这是一个很棒的库(如果你不知道就试试吧) 但我正在尝试转换这一行: x = B/A 与 B = rand(7,20); A =
r - 如何在 R 中获取面板数据固定效应回归的 corr(u_i, Xb)
我正在尝试使用 R 中的 plm 包为面板数据开发固定效应回归模型。我想获得固定效应和回归变量之间的相关性。 Stata 输出中的 corr(u_i, Xb) 之类的东西。如何在 R 中获取它？我尝试
python - 如何替换 r'\xb 0' with r'\260'
如何在字符串中替换这些字符:r'\xb0' 为 r'\260'，我已经尝试过: test = u'\xb0C' test = test.encode('latin1') test = test.rep
python - Sklearn，高斯过程 : XA and XB must have the same number of columns
我对 python 很陌生，对做高斯回归很感兴趣。我在 py3.6 和 SKlearn 0.19 下。我有简单的代码，但我得到了一个关于预测调用的 cdist 中向量维度的错误。我知道我的输入有问题
c - 在无限棋盘上，骑士从 xb、yb 到 xe、ye 可以走的路线数
好吧，我必须编写一个程序来计算一个骑士(在棋盘上)从 (xb, yb) 到 (xe, ye) 可以走的路线数。我不确定我哪里出错了。好吧，我知道计数不会添加任何东西，并且会在我的代码中保持为 0，但我
Python unicode规范化: is it correct to translate u'\xb 4' to u' \u0301'
看下面的片段: >>> import unicodedata >>> from unicodedata import normalize, name >>> normalize('NFKD', u'\
java - 线程 "main"org.jboss.xb.binding.JBossXBRuntimeException : Failed to create a new SAX parser 中的异常
我用 Java 创建了一个独立的 Web 服务客户端。我能够正确生成 WSDL，但是当我执行我的 run.bat 文件时，我得到上面的异常，下面是异常。我用谷歌搜索了标题中显示的异常，并找到了一个包含

首页

博学

6Ren·AI

商城

python - 错误 : 'utf-8' codec can't decode byte 0xb0 in position 0: invalid start byte in google colab