gpt4 book ai didi

python - 子进程调用的重定向输出丢失了吗?

转载 作者:塔克拉玛干 更新时间:2023-11-03 00:48:12 25 4
gpt4 key购买 nike

我有一些 Python 代码大致是这样的,使用了一些你可能有也可能没有的库:

# Open it for writing
vcf_file = open(local_filename, "w")

# Download the region to the file.
subprocess.check_call(["bcftools", "view",
options.truth_url.format(sample_name), "-r",
"{}:{}-{}".format(ref_name, ref_start, ref_end)], stdout=vcf_file)

# Close parent process's copy of the file object
vcf_file.close()

# Upload it
file_id = job.fileStore.writeGlobalFile(local_filename)

基本上,我正在启动一个子进程,该子进程应该为我下载一些数据并将其打印到标准输出。我将该数据重定向到一个文件,然后,一旦子进程调用返回,我就关闭我对该文件的句柄,然后将该文件复制到其他地方。

我观察到,有时,我期望的数据的尾端没有进入副本。现在,bcftools 可能只是偶尔不写入该数据,但我担心我可能会做一些不安全的事情并以某种方式在 subprocess.check_call() 返回后访问文件,但之前 子进程写入标准输出的数据将其写入我可以看到的磁盘上。

查看 C 标准(因为 bcftools 是在 C/C++ 中实现的),看起来当程序正常退出时,所有打开的流(包括标准输出)都被刷新并关闭。请参阅 [lib.support.start.term] 部分 here ,描述 exit() 的行为,当 main() 返回时隐式调用:

--Next, all open C streams (as mediated by the function signatures declared in ) with unwritten buffered data are flushed, all open C streams are closed, and all files created by calling tmp- file() are removed.30)

--Finally, control is returned to the host environment. If status is zero or EXIT_SUCCESS, an implementation-defined form of the status successful termination is returned. If status is EXIT_FAILURE, an implementation-defined form of the status unsuccessful termination is returned. Otherwise the status returned is implementation-defined.31)

因此在子进程退出之前,它会关闭(并因此刷新)标准输出。

然而,manual page对于 Linux close(2) 注意关闭文件描述符并不一定保证写入它的任何数据实际上已经写入磁盘:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored, use fsync(2). (It will depend on the disk hardware at this point.)

因此,看起来,当进程退出时,其标准输出流被刷新,但如果该流实际上由指向磁盘上文件的文件描述符支持,则不能保证写入磁盘已完成.我怀疑这可能就是这里发生的事情。

所以,我的实际问题:

  1. 我对规范的解读是否正确?子进程能否在其重定向的标准输出在磁盘上可用之前在其父进程看来已终止?

  2. 是否有可能以某种方式等到子进程写入文件的所有数据实际上已被操作系统同步到磁盘?

  3. 我应该在父进程的文件对象副本上调用 flush() 还是某些 Python 版本的 fsync()?这是否可以强制将子进程对同一文件描述符的写入提交到磁盘?

最佳答案

是的,数据写入磁盘(物理)之前可能需要几分钟。但您可以在此之前很久就阅读它。

除非您担心电源故障或内核崩溃;数据是否在磁盘上并不重要。内核是否认为数据已写入的重要部分。

一旦 check_call() 返回,就可以安全地从文件中读取。如果您没有看到所有数据;它可能表明 bcftools 中存在错误,或者 writeGlobalFile() 没有上传文件中的所有数据。您可以尝试通过禁用 bsftools 标准输出 (provide a pseudo-tty, use unbuffer command-line utility, etc) 的 block 缓冲模式来解决前者问题。

Q: Is my reading of the specs correct? Can a child process appear to its parent to have terminated before its redirected standard output is available on disk?

是的。是的。

Q: Is it possible to somehow wait until all data written by the child process to files has actually been synced to disk by the OS?

没有。 fsync() 在一般情况下是不够的。可能,您无论如何都不需要它(读回数据是一个不同的问题,与确保将数据写入磁盘不同)。

Q: Should I be calling flush() or some Python version of fsync() on the parent process's copy of the file object? Can that force writes to the same file descriptor by child processes to be committed to disk?

这将毫无意义。 .flush() 刷新父进程内部的缓冲区(您可以使用 open(filename, 'wb', 0) 避免在父进程中创建不必要的缓冲区) .

fsync()在文件描述符上工作( child 有自己的文件描述符)。我不知道内核是否对引用同一磁盘文件的不同文件描述符使用不同的缓冲区。同样,没关系——如果您观察到数据丢失(无崩溃); fsync() 在这里无济于事。

Q: Just to be clear, I see that you're asserting that the data should indeed be readable by other processes, because the relevant OS buffers are shared between processes. But what's your source for that assertion? Is there a place in a spec or the Linux documentation you can point to that guarantees that those buffers are shared?

寻找"After a write() to a regular file has successfully returned" :

Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

关于python - 子进程调用的重定向输出丢失了吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34623639/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com