gpt4 book ai didi

python - 如何防止 LXML 错误 'failed to load external entity'

转载 作者:塔克拉玛干 更新时间:2023-11-03 00:59:47 27 4
gpt4 key购买 nike

我在使用 lxml.html.parse() 时遇到了一些问题:

这是我的代码(缩写):

import lxml.html

class Scraper:

def fetch(self, url):

tree = None

try:
parser = lxml.html.HTMLParser(encoding='utf8')
tree = lxml.html.parse(url, parser)
except IOError as e:
print('ERROR LOADING PAGE: ' + str(e))

return tree

它大部分工作正常,但有时我会遇到很多这样的错误:

ERROR LOADING PAGE: Error reading file 'b'http://twitter.com/wordpressdotcom'': b'failed to load external entity "http://twitter.com/wordpressdotcom"'

ERROR LOADING PAGE: Error reading file 'b'http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible'': b'failed to load HTTP resource'

ERROR LOADING PAGE: Error reading file 'b'http://plugins.trac.wordpress.org/changeset/559098'': b'failed to load external entity "http://plugins.trac.wordpress.org/changeset/559098"'

我在这里查看了其他问题和答案,但他们只能建议使用 urllib - 但当我尝试时这并没有真正帮助。

我想要的是禁止加载“外部实体”,不管它到底是什么意思。我只需要给定 URL 的 html。

最佳答案

当我用 Wireshark 嗅探时,我看到了这个:

http://twitter.com/wordpressdotcom:

GET /wordpressdotcom HTTP/1.0
Host: twitter.com
Accept-Encoding: gzip

HTTP/1.0 301 Moved Permanently
content-length: 0
date: Sat, 01 Feb 2014 12:08:01 UTC
location: https://twitter.com/wordpressdotcom
server: tfe
set-cookie: guest_id=v1%3A139125648190241848; Domain=.twitter.com; Path=/; Expires=Mon, 01-Feb-2016 12:08:01 UTC

http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible

GET /gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible HTTP/1.0
Host: www.amazon.com
Accept-Encoding: gzip

HTTP/1.1 503 Service Unavailable
Date: Sat, 01 Feb 2014 12:10:49 GMT
Server: Server
Last-Modified: Fri, 30 Nov 2012 01:26:22 GMT
ETag: "3dd-4cfac498acb80-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 599
Connection: close
Content-Type: text/html

...........SMo.0...Wp..]b....M...Pl.....`l.V+K..8M..(;.v.a..S..(.=.m....l.k.u......~...V....b.....j:.U...S.u."..k.|vy....J.P4..fY...x0..7....[Kp....S.Y.O...>B.GKk.c].....0/..wR9.ag.q...F...6hg....M....d........N.vk..Yi}8.r.......V..t
.... !...B.0..f.._9.G...\....OY0...-..{........xZ^.......n~.(8.:.k%1
Z2M+....[.5.Z.2.R..DL.KV.y2.Y...4N...z....Z.N....V........].DV.z^..}..j>W.;..WB.bS.......ba.3.g..G8......".}b...th1....a."`x........>[.@......8-........z.q.{.CJE.@>.d..?...UK...dQ'.J
....KW..v...iK.q.=-AI.?....za7.=/u/.......T.Sf}...\t.iJ. ..8.....U...dg...9..t#.g......Lz.. .?...i.........L]....

对于 http://plugins.trac.wordpress.org/changeset/559098

GET /changeset/559098 HTTP/1.0
Host: plugins.trac.wordpress.org
Accept-Encoding: gzip

HTTP/1.1 302 Found
Date: Sat, 01 Feb 2014 12:13:06 GMT
Server: Apache
Location: https://plugins.trac.wordpress.org/changeset/559098
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 242
Connection: close
Content-Type: text/html; charset=iso-8859-1

..........uO.N.0...+L..eh..L$X;1i@......fR.DI:...n.r.l?.....|.4.u./.......n..-..j..eS^..(...\fd..K2.t..1.,...l.4j.."#<.....3.N^..e.dc..m....F....5.....171......;.AD.Z.c.v.C..w..5v.8.r....\..L.. t..OEi=3..Sm.<.?.....e....*................|7...

lxml 显然不处理重定向,对于 Amazon 的情况,您可能需要使用真正的“User-Agent” header 。

您应该使用另一个库来下载页面的内容,例如requestsurllib(2) ,然后将此 HTML 提供给 lxml.html

关于python - 如何防止 LXML 错误 'failed to load external entity',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21496857/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com