UnicodeDecodeError : unexpected end of data-6ren

UnicodeDecodeError : unexpected end of data

转载作者：行者123 更新时间：2023-12-04 03:52:15

27

4

我有一个巨大的文本文件，我想打开它。
我正在分块读取文件，避免与一次读取过多文件相关的内存问题。

代码片段:

def open_delimited(fileName, args):

    with open(fileName, args, encoding="UTF16") as infile:
        chunksize = 10000
        remainder = ''
        for chunk in iter(lambda: infile.read(chunksize), ''):
            pieces = re.findall(r"(\d+)\s+(\d+_\d+)", remainder + chunk)
            for piece in pieces[:-1]:
                yield piece
            remainder = '{} {} '.format(*pieces[-1]) 
        if remainder:
            yield remainder

代码抛出错误 UnicodeDecodeError: 'utf16' codec can't decode bytes in position 8190-8191: unexpected end of data .

我试过 UTF8并得到错误 UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte .
latin-1和 iso-8859-1引发错误 IndexError: list index out of range
输入文件示例:

b'\xff\xfe1\x000\x000\x005\x009\x00\t\x001\x000\x000\x005\x009\x00_\x009\x007\x004\x007\x001\x007\x005\x003\x001\x000\x009\x001\x00\t\x00\t\x00P\x00o\x00s\x00t\x00\t\x001\x00\t\x00H\x00a\x00p\x00p\x00y\x00 \x00B\x00i\x00r\x00t\x00h\x00d\x00a\x00y\x00\t\x002\x000\x001\x001\x00-\x000\x008\x00-\x002\x004\x00 \x00'

我还将提到我有几个这样的巨大文本文件。 UTF16它们中的许多都可以正常工作，但在特定文件中失败。

无论如何要解决这个问题？

最佳答案

要忽略损坏的数据(可能导致数据丢失)，请设置 errors='ignore'在 open()称呼:

with open(fileName, args, encoding="UTF16", errors='ignore') as infile:

open() function documentation状态:

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

这并不意味着您可以从遇到的明显数据损坏中恢复。

为了说明这一点，想象一个字节被删除或添加到您的文件中的某处。 UTF-16 是一种编解码器，每个字符使用 2 个字节。如果有一个字节丢失或剩余，则丢失或额外字节之后的所有字节对都将不对齐。

这可能会导致进一步解码的问题，不一定是立即解码。 UTF-16 中有一些代码点是非法的，但通常是因为它们与另一个字节对组合使用；对于这样一个无效的代码点，您的异常被抛出。但是在该点之前可能有数百或数千个字节对是有效的 UTF-16，如果不是清晰的文本。

关于UnicodeDecodeError : unexpected end of data，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18357675/

27

4

0

文章推荐： ruby-on-rails - 带缓存的 Heroku

文章推荐： haskell - 管道广播

文章推荐： openxml - 公式元素中的 t ="shared"是什么意思？

文章推荐： ruby-on-rails - DirtyAttributes 采用更改后的 BigDecimal 类型

xpath - 抛出错误 : XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected For, expecting Order or Return or Stable
一旦在 qconsole Marklogic 中运行以下代码，我就会遇到以下错误 XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax err
python - "Unexpected"类型错误 : unexpected keyword
我已经在我的包中编写了这个函数。 def partitionIntoDays(ls, number, lookupKey=None): ''' Partitions the location
java - 断言错误 : Unexpected schema version 0: Unexpected schema version 0
我只是一个 android 初学者，我已经安装了 Android Studio(版本是 1.0.2)，并创建了一个新的空白应用程序，按照名为“构建你的第一个应用程序”的官方教程，我学习到这个页面' h
ruby-on-rails - 乘客错误 : The application spawner server exited unexpectedly: Unexpected end-of-file detected
这只是前几天工作，但我刚刚将我的代码更新到运行乘客 2.2.4 的审查服务器，而我的 2.3.4 rails 应用程序现在无法在那个盒子上启动。乘客报告: Passenger encountered
javascript - 错误: (SystemJS) Unexpected token < SyntaxError: Unexpected token < at eval () Angular 2
我正在尝试使用带有 Angular 2的整页，将其导入我的 app.module.ts 时出现以下错误。 "(SystemJS) Unexpected token ) at Obje
logging - TFS2015 vNext 构建失败 :MSBUILD : error MSB4017: The build stopped unexpectedly because of an unexpected logger failure
TFS2015 vNext 构建失败并出现记录器错误(下面附有错误消息)。根据我的调查，这看起来与 CentralLogger - "Microsoft.TeamFoundation.Distribu
C程序帮助: Unexpected Output
计算机科学学校项目。我需要编写一个程序，其中用户声明数组的大小，然后以数字、非递减顺序填充数组，然后声明一个值 x。然后将 X 分配到适当的位置，以便整个数组按数字、非递减顺序排列。然后输出该数组。
Java方法参数给出编译错误 "Unexpected bound"
在这 2 个方法中，inspect1 显示编译错误“Unexpected bound”而 inspect2 工作正常，为什么？ public void inspect1(List u){ S
Python连接mysql错误: unexpected indent
已关闭。这个问题是 not reproducible or was caused by typos 。目前不接受答案。这个问题是由拼写错误或无法再重现的问题引起的。虽然类似的问题可能是 on-top
windows - “was unexpected at this time.”
我正在尝试运行以下代码，但遇到了“此时意外”错误。 (echo COPY (SELECT ta.colA as name, ta.colB as user_e, ta.colC as user_n,
MySQL语法错误: unexpected 'unique'
我有以下查询: select u.UserName, count(*) as total from Voting v join User u using (UserID) where unique (
MySQL转换为日期时间语法错误: unexpected IDENT_QUOTED
我们有以下查询在 MSSQL 中完美运行但在 MySQL 中无法运行: select CONVERT(datetime, dateVal) as DateOccurred, itemID, COUNT
Python函数缩进错误: unexpected indent
我的代码中存在缩进错误问题。它看起来是正确的...有人能指出我做错了什么吗？我的查询行不断收到错误。 def invoice_details(myDeliveryID): conn = pym
C++ : Unexpected output
我有以下代码: int a , b , sum; cin>>a>>b; sum=a+b; cout>a>>b>>c; cout<
PHP "unexpected $end"
这个问题不太可能帮助任何 future 的访问者；它只与一个小的地理区域、一个特定的时间点或一个非常狭窄的情况有关，这些情况并不普遍适用于互联网的全局受众。为了帮助使这个问题更广泛地适用，visit
windows - "was unexpected at this time."
我在一个批处理文件上运行这个命令: for %I in (*.txt *.doc) do copy %I c:\test2 ...它不断返回: I was unexpected at this tim
java - "from unexpected"createQuery时
创建查询时出现错误: 'from' unexpected 我的代码如下: @Override public Admin findByAdmin(Admin admin) {
python "unexpected indent"
我正在尝试运行此 python 代码，但我不断收到错误消息“意外缩进”。我不确定怎么了。间距似乎很好。有什么想法吗？ services = ['Service1'] for service in
python循环依赖问题: unexpected error
我在名为“circular_dependency”的目录中有一些 python 文件: 导入文件_1.py: from circular_dependency.import_file_2 import
c++ - 语法错误 : "(" unexpected
我正在尝试使用 gcc 编译代码并运行可执行文件，但它抛出错误: gcc somefile.c -o somefile 编译成功。但是，当我尝试执行它时: $sh somefile 它导致:语法错误:

首页

博学

6Ren·AI

商城

UnicodeDecodeError : unexpected end of data