gpt4 book ai didi

python - python中的lxml,从url解析

转载 作者:太空狗 更新时间:2023-10-29 22:08:19 26 4
gpt4 key购买 nike

我是 lxml 的新手。我想下载网页并从中获取感兴趣的数据,我的代码是:

import urllib2
from lxml import etree

url = "http://www.example.com/"

html = urllib2.urlopen(url)

root = etree.parse(html) # the problem is here

谁能解释一下为什么错了?

错误是:

Traceback (most recent call last):
File "yatego.py", line 10, in <module>
root = etree.parse(html)
File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703)
File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012)
File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908)
File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21

这段代码:

url = "http://www.example.com/"

res = requests.get(url)
doc = lxml.html.parse(res.content)

给出这个错误:

File "yatego.py", line 11, in <module>
doc = lxml.html.parse(res.content)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485)
File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843)
File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>IANA &mdash; Example domains</title>

这段代码:

doc = lxml.html.parse(url)

工作正常

那么问题出在哪里呢?

最佳答案

这里的关键是异常:

IOError: Error reading file '<!DOCTYPE html PUBLIC  ...

您将文件的内容传递给需要文件路径的函数。同样的原因 doc = lxml.html.parse(url) 有效,一个 url“是一个”文件路径。

下面的效果更好吗?

doc = lxml.html.fromstring(res.content)

关于python - python中的lxml,从url解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9783875/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com