gpt4 book ai didi

java - 无法让 apache nutch 爬行 - 权限和 JAVA_HOME 可疑

转载 作者:行者123 更新时间:2023-12-01 14:23:09 35 4
gpt4 key购买 nike

我正在尝试按照 NutchTutorial 运行基本爬网:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

所以我已经安装了 Nutch,并使用 Solr 进行了设置。我将 .bashrc 中的 $JAVA_HOME 设置为 /usr/lib/jvm/java-1.6.0-openjdk-amd64

当我从 nutch 主目录运行 bin/nutch 时,我没有看到任何问题,但是当我尝试按上面的方式运行爬网时,出现以下错误:

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /usr/share/nutch/logs/hadoop.log (Permission denied)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:207)
at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:270)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:281)
at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2013-06-28 16:24:53
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:296)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

我怀疑这可能与文件权限有关,因为我必须在此服务器上的几乎所有内容上运行 sudo,但如果我使用 sudo 运行相同的抓取命令> 我得到:

Error: JAVA_HOME is not set.

所以我觉得我遇到了第 22 条军规的情况。我是否应该能够使用 sudo 运行此命令,或者是否需要执行其他操作,这样我就不必使用 sudo 运行它并且它可以工作,或者这里完全发生了其他事情?

最佳答案

作为普通用户,您似乎无权写入/usr/share/nutch/logs/hadoop.log,这作为安全功能是有意义的。

要解决这个问题,请创建一个简单的 bash 脚本:

#!/bin/sh
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
bin/nutch crawl urls -dir crawl -depth 3 -topN 5

将其保存为 nutch.sh,然后使用 sudo 运行它:

sudo sh nutch.sh

关于java - 无法让 apache nutch 爬行 - 权限和 JAVA_HOME 可疑,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17374028/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com