gpt4 book ai didi

Nutch crawl 没有错误,但结果是什么

转载 作者:行者123 更新时间:2023-12-04 05:02:18 25 4
gpt4 key购买 nike

我尝试使用 nutch 2.1 抓取一些网址,如下所示。

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

http://wiki.apache.org/nutch/NutchTutorial

没有错误,但不制作下面提到的文件夹。
crawl/crawldb
crawl/linkdb
crawl/segments

任何人都可以帮助我吗?
我已经两天没有解决这个问题了。
非常感谢!

输出如下。
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=9
-finishing thread FetcherThread0, activeThreads=8
-finishing thread FetcherThread1, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all

运行时/本地/conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.robots.agents</name>
<value>My Nutch Spider</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.content.limit</name>
<value>262144</value>
</property>
</configuration>

运行时/本地/conf/regex-urlfilter.txt
# accept anything else
+.

运行时/本地/urls/seed.txt
http://nutch.apache.org/

最佳答案

当您使用 Nutch 2.X 时,您需要遵循相关的 tutorial .您提供的那个是用于 Nutch 1.x 的。 Nutch 2.X 使用外部存储后端,如 HBase、Cassandra,因此不会形成 crawldb、segments 等目录。

另外,使用 bin/crawl脚本而不是 bin/nutch命令。

关于Nutch crawl 没有错误,但结果是什么,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15995457/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com