gpt4 book ai didi

java - Apache Nutch 跳过 URL 并截断

转载 作者:行者123 更新时间:2023-12-02 01:56:16 25 4
gpt4 key购买 nike

在我的 nutch-site.xml 中,我添加以下内容以停止截断;但是,在获取过程中,我收到以下错误。我希望它停止截断并提供我需要的结果,我假设 -1 值可以实现。我使用的是2.2.1版本。有什么想法吗?

<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>

Exception in thread "main" java.lang.RuntimeException: job failed:name=fetch, jobid=job_local1185573074_0001 atorg.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55) atorg.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:194) atorg.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219) atorg.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301) atorg.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) atorg.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307)

最佳答案

我通过删除nutch-site.xml中的http.content.limit部分并添加parser.skip.truncated并将其设置为false来解决这个问题。

<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>

关于java - Apache Nutch 跳过 URL 并截断,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57397243/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com