gpt4 book ai didi

java.lang.NullPointerException(nutch 2.2.1 和 MySql 作为数据存储)

转载 作者:行者123 更新时间:2023-11-30 00:48:12 24 4
gpt4 key购买 nike

我是这个领域的新手。我从本教程开始:http://nlp.solutions.asia/?p=362#more-362 。当我第一次爬取这个网址:nutch.apache.org时,我成功了,但是当我尝试不同的网址时,我的hadoop.log中出现了这个异常。

**java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)**
<小时/>

这是我的 nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Maria</value>
</property>

<property>
<name>http.robots.agents</name>
<value>Maria</value> ....
</description>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

</configuration>
<小时/>

这是 regex-ulrfilter.txt:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.
(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip
|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.

+^http://([a-z0-9]*\.)* nutch.apache.org/

#
-.
<小时/>

如果有任何解决此问题的建议,我将不胜感激

最佳答案

我从未使用过nutch,但这似乎是一个常见错误,在init 启动的NPE 意味着UTF8 实例在创建时失败。

原因是“crawl”函数在 Nutch2 中已被弃用,取而代之的是位于“bin/crawl”中的 java 文件

只需将文件 $NUTCH_HOME/src/bin/crawl 复制到部署目录:$NUTCH_HOME/runtime/deploy/bin 然后运行爬网命令,看看这里:

http://wiki.apache.org/nutch/NutchTutorial#A3.1_Using_the_Crawl_Command

希望这有帮助。

关于java.lang.NullPointerException(nutch 2.2.1 和 MySql 作为数据存储),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21198202/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com