gpt4 book ai didi

java - Apache Nutch 2.3.1 Fetcher 给出无效 uri 异常

转载 作者:行者123 更新时间:2023-12-02 02:15:16 26 4
gpt4 key购买 nike

我已经使用 Hadoop 生态系统配置了 Apache Nutch 2.3.1。我必须获取一些阿拉伯文字网站。 Nutch 在获取时对少数 URL 给出异常(exception)。以下是一个异常示例

java.lang.IllegalArgumentException: Invalid uri 'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html': escaped absolute path not valid
at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:77)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:173)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245)
at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:564)

最佳答案

即使在 1.x 分支上,我也能够重现此问题。问题是 Apache HTTP 客户端库内部使用的 Java URI 类不支持非转义 UTF-8 字符:

来自 java.net.URI 的 JavaDoc 文档:

Character categories

RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:

  • alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
  • digit The US-ASCII decimal digit characters, '0' through '9'
  • alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
  • punct The characters in the string ",;:$&+="
  • reserved All punct characters together with those in the string "?/[]@"
  • escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
  • other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)

The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.

正确转义后的 URL 看起来更像:

http://agahi.safirak.com/ads/850/%D9%BE%DB%8C%DA%86-%D8%A8%D9%86%D8%AF-%D8%A8%D8%A7%D8%AF%DB%8C-%D9%87%D9%81%D8%AA%DB%8C%D8%B1%DB%8C-1800-%D8%AF%D9%88%D8%B1-%D8%A8%D8%A7%D8%AF%DB%8C-%D8%AC%DB%8C%D8%B3%D9%88%D9%86.html

实际上,如果您在 Chrome 上打开示例 URL,然后从地址栏中复制该 URL,您将获得转义的表示形式。请随意为此打开一个问题(否则我会这样做)。同时,您可以尝试使用不使用 Apache HTTP 客户端的 protocol-http 插件。我已经在本地测试过,解析检查器工作正常:

➜  local (master) ✗ bin/nutch parsechecker "http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html"
fetching: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
parsing: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
contentType: text/html
signature: 048b390ab07464f5d61ae09646253529
---------
Url
---------------

http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: پیچ بند بادی هفتیری 1800 دور بادی جیسون-نیازمندی سفیرک
Outlinks: 76
outlink: toUrl: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html anchor:
outlink: toUrl: http://agahi.safirak.com/assets/fonts/font-awesome/css/font-awesome.min.css anchor:
outlink: toUrl: http://agahi.safirak.com/assets/css/bootstrap.css anchor:
...

关于java - Apache Nutch 2.3.1 Fetcher 给出无效 uri 异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49379007/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com