gpt4 book ai didi

python - 我可以在 Python 3 上提供 lxml.etree.parse 的 URL 吗?

转载 作者:太空狗 更新时间:2023-10-29 22:22:19 26 4
gpt4 key购买 nike

文档说我可以:

lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz).

(来自“解析器”下的 http://lxml.de/parsing.html)

但快速实验似乎暗示并非如此:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> from urllib.request import urlopen
>>> with urlopen('https://pypi.python.org/simple') as f:
... tree = etree.parse(f, parser)
...
>>> tree2 = etree.parse('https://pypi.python.org/simple', parser)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015)
OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple"
>>>

我可以使用 urlopen 方法,但文档似乎暗示传递 URL 在某种程度上更好。此外,如果文档不准确,我有点担心依赖 lxml,特别是当我开始需要做任何更复杂的事情时。

从已知 URL 使用 lxml 解析 HTML 的正确方法是什么?我应该在哪里查看记录的内容?

更新:如果我使用 http URL 而不是 https URL,我会得到同样的错误。

最佳答案

问题是 lxml 不支持 HTTPS url,并且 http://pypi.python.org/simple重定向到 HTTPS 版本。

因此对于任何安全网站,您都需要自己阅读 URL:

from lxml import etree
from urllib.request import urlopen

parser = etree.HTMLParser()

with urlopen('https://pypi.python.org/simple') as f:
tree = etree.parse(f, parser)

关于python - 我可以在 Python 3 上提供 lxml.etree.parse 的 URL 吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26163247/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com