gpt4 book ai didi

java - Nutch 抓取后的 Solr 索引失败,报告 "Indexer: java.io.IOException: Job failed!"

转载 作者:太空宇宙 更新时间:2023-11-04 11:28:39 24 4
gpt4 key购买 nike

我已在 ec2 实例上将 Nutch1.13 与 Solr 6.5.1 集成。我确实使用下面的 cp 命令将 schema.xml 复制到 Solr。我已在 nutch_home/conf 文件夹中的 nutch-site.xml 中将 localhost 指定为 elatic.host。

cp /usr/local/apache-nutch-1.13/conf/schema.xml /usr/local/apache-nutch-1.13/solr-6.5.1/server/solr/nutch/conf/

而且,自从 solr 6 以来,每次创建托管模式时,索引之前的一切都工作正常。我尝试过的命令是

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/#/nutch/ urls/ crawl 1

一切看起来都很好,直到在运行上述命令时进行索引。我完全陷入了最后一步。

Error running: /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=://35.160.82.191:8983/solr/#/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519074733 Failed with exit value 255.

提前致谢

更新我在conf/nutch-site.xml中更改了以下属性

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

现在没有错误,但我得到了关注

Deduplication finished at 2017-05-19 10:08:05, elapsed: 00:00:03 Indexing 20170519100420 to index /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519100420 Segment dir is complete: crawl/segments/20170519100420. Indexer: starting at 2017-05-19 10:08:06 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false No IndexWriters activated - check your configuration Indexer: number of documents indexed, deleted, or skipped: Indexer: 44 indexed (add/update) Indexer: finished at 2017-05-19 10:08:10, elapsed: 00:00:03 Cleaning up index if possible /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb Fri May 19 10:08:13 UTC 2017 : Finished loop with 1 iterations

更新2我发现在nutch-site.xml 帮助中添加solr-indexer,如本post 中所述。但现在错误出现在清洁部分

Error running: /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=://35.160.82.191:8983/solr/nutch/ crawl/crawldb Failed with exit value 255.

任何建议,因为我想使用 Solr 实现搜索引擎更新3

现在完全没有错误了。但由于某种原因,获取无法正常工作。仅获取和爬网 urls/seed.txt 中指定的 URL。没有任何外部链接后跟 nutch。

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/nutch/ urls/ crawl 5 Injecting seed URLs /usr/local/apache-nutch-1.13/bin/nutch inject crawl/crawldb urls/ Injector: starting at 2017-05-19 12:27:19 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 1 Injector: Total new urls injected: 0 Injector: finished at 2017-05-19 12:27:21, elapsed: 00:00:02 Fri May 19 12:27:21 UTC 2017 : Iteration 1 of 5 Generating a new segment /usr/local/apache-nutch-1.13/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2017-05-19 12:27:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now

我想将 nutch 数据用于 Solr 的网络搜索结果最终更新

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=://35.160.82.191:8983/solr/nutch/ urls/ crawl  1 

最佳答案

nutch-site.xml 不需要复制到 Solr,只需复制 schema.xml 文件来指定来自 Nutch 的数据所需的架构。如果您使用的是 Solr 而不是 ES,则不需要此参数elatic.host。检查 logs/hadoop.log 文件,看看是否有更多有关异常的数据,当然,检查 Solr 端的日志,此错误通常意味着 Solr 配置有问题,缺少字段等。在这种情况下,由于您没有复制 schema.xml 并且 Nutch 没有利用 Solr 6 上的托管架构,Solr 一定会提示缺少字段,而且您的 solr URL 包含 # 字符看起来也不太好,这就是 Solr 管理 UI 在浏览器中显示数据的方式,但要从 Nutch/terminal 使用它应该是 /solr/nutch

顺便说一句,检查the tutorial尽管最近的 Solr 版本中的一些路径发生了变化,但对于集成如何工作仍然是一个很好的指导

关于java - Nutch 抓取后的 Solr 索引失败,报告 "Indexer: java.io.IOException: Job failed!",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44064781/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com