gpt4 book ai didi

python - 为什么 file_object.tell() 会为不同位置的文件提供相同的字节?

转载 作者:太空宇宙 更新时间:2023-11-04 10:00:08 24 4
gpt4 key购买 nike

刚开始使用 python,我无法绕过基本的文件导航方法。

当我阅读 tell() tutorial 它指出它返回我当前在我的文件上的位置(按字节)。

我的推理是文件的每个字符加起来就是字节坐标,对吧?这意味着在换行之后,这只是在 \n 字符上拆分的字符串,我的字节坐标会改变......但这似乎是不正确的。

我在 bash 上生成一个快速玩具文本文件

$ for i in {1..10}; do echo "@ this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt

现在我将遍历此文件并打印出行号,并在每个循环中打印出 tell() 调用的结果。 @ 用于标记一些分隔文件 block 的行,我想返回这些行(见下文)。

我的猜测是 for 循环 first 遍历文件对象,直到结束,因此它始终保持不变。

这是玩具示例,在我的实际问题中,文件的长度为 Gigs,通过应用相同的方法,我得到了 tell() 的结果,在我的图像 block 中反射(reflect)了 for循环遍历文件对象。它是否正确?您能否阐明我遗漏的概念?

我的最终目标是能够找到文件中的特定坐标,然后从分布式起点并行处理这些巨大的文件,而我无法以筛选它们的方式对其进行监控。

os.path.getsize("toy.txt")
451

fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("@"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1

输出:

@ this is the 1th line
tell 451 count 0
@ this is the 2th line
tell 451 count 1
@ this is the 3th line
tell 451 count 2
@ this is the 4th line
tell 451 count 3
@ this is the 5th line
tell 451 count 4
@ this is the 6th line
tell 451 count 5
@ this is the 7th line
tell 451 count 6
@ this is the 8th line
tell 451 count 7
@ this is the 9th line
tell 451 count 8
@ this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19

最佳答案

您正在使用 for 循环逐行读取文件:

for line in fa:

文件通常不会这样做;您读取数据 block ,通常是 block 。为了让 Python 给你换行,你需要一直读到下一个换行符。只是,逐字节读取以查找换行符效率不高。

因此使用了一个缓冲区;您阅读了一大块,然后在该 block 中找到换行符并为找到的每个换行符生成一行。缓冲区用完后,您将读取一个新 block 。

您的文件不够大,无法读取多个 block ;它只有 451 字节小,而缓冲区通常以千字节为单位。如果您要创建一个更大的文件,您会在迭代时看到文件位置大步跳跃。

参见 file.next documenation (next 是迭代时负责产生下一行的方法,for 循环的作用):

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.

如果您需要在遍历行时跟踪绝对文件位置,则必须在 Windows 上使用二进制模式(以防止发生换行符转换),并跟踪行的长度自己:

position = 0    
for line in fa:
position += len(line)

另一种方法是使用 io library ;这是 Python 3 中用于处理文件的框架。 file.tell() 方法将缓冲区考虑在内并会生成准确的文件位置即使在迭代时也是如此

考虑到当你使用 io.open()文本模式打开文件,您将获得unicode 字符串。在 Python 2 中,如果你必须有 str 字节串,你可以只使用二进制模式(用 'rb' 打开)。事实上,只有在二进制模式下,您才能访问 IOBase.tell(),在文本模式下会抛出异常:

>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'@ this is the 1th line\n'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call

在二进制模式下,您可以获得 file.tell() 的准确输出:

>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
... if line.startswith("@"):
... print line ,
... print "tell {} count {}".format(fa.tell(), count)
... else:
... if count < 32775:
... print line,
... print "tell {} count {}".format(fa.tell(), count)
... count += 1
...
@ this is the 1th line
tell 23 count 0
@ this is the 2th line
tell 46 count 1
@ this is the 3th line
tell 69 count 2
@ this is the 4th line
tell 92 count 3
@ this is the 5th line
tell 115 count 4
@ this is the 6th line
tell 138 count 5
@ this is the 7th line
tell 161 count 6
@ this is the 8th line
tell 184 count 7
@ this is the 9th line
tell 207 count 8
@ this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19

关于python - 为什么 file_object.tell() 会为不同位置的文件提供相同的字节?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43987187/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com