gpt4 book ai didi

python - 从 WARC.gz 文件中提取 header

转载 作者:行者123 更新时间:2023-11-28 18:44:41 27 4
gpt4 key购买 nike

我在网站上搜索了很多次,但找不到我真正需要的东西。我有包含数据的 web.warc.gz 文件,我需要提取 WARC header 。我已经安装了 Tomcat 和 Wayback (1.6),试图使用 Wayback 提供的 ./warc-header 脚本来导出它,但我不断收到有关我正在使用的格式的错误消息:

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
USAGE: tgtWarc fieldsSrc id
tgtWarc is the path to the target WARC.gz
fieldsSrc is the path to the text of the record
make sure each line is terminated by \r\n
and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
of the header record... header...

或者另一种类型的错误:

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
~/Desktop/output.csv Content-Type
java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

我很确定这是我在命令行中编写的格式,但我仍然无法正确处理。请帮忙?

最佳答案

关于python - 从 WARC.gz 文件中提取 header ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21922726/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com