作者热门文章
- c - 在位数组中找到第一个零
- linux - Unix 显示有关匹配两种模式之一的文件的信息
- 正则表达式替换多个文件
- linux - 隐藏来自 xtrace 的命令
我正在尝试使用 elt.itertext()
(v3.5.0b1) 遍历子树的文本内容,如下所示:
import lxml.html.soupparser as soupparser
import requests
doc = requests.get("http://f10.5post.com/forums/showthread.php?t=1142017").content
tree = soupparser.fromstring(doc)
nodes = tree.getchildren()
for elt in nodes:
for t in elt.itertext():
print t
但我一直收到错误提示
File "src/lxml/iterparse.pxi", line 248, in lxml.etree.iterwalk.__init__ (src/lxml/lxml.etree.c:134032)
File "src/lxml/apihelpers.pxi", line 67, in lxml.etree._rootNodeOrRaise (src/lxml/lxml.etree.c:15220)
ValueError: Input object has no element: HtmlComment
有没有办法跳过所有 HTML 注释?另外,这个错误到底是什么意思?
谢谢
最佳答案
这是正常的。
>>> from lxml import etree
>>> doc = '''
... <html><!-- PAGENAV POPUP -->
... <div class="vbmenu_popup" id="pagenav_menu" style="display:none">
... <table cellpadding="4" cellspacing="1" border="0">
... <tr>
... <td class="thead" nowrap="nowrap">Go to Page...</td>
... </tr>
... <tr>
... <td class="vbmenu_option" title="nohilite">
... <form action="index.php" method="get" onsubmit="return this.gotopage()" id="pagenav_form">
... <input type="text" class="bginput" id="pagenav_itxt" style="font-size:11px" size="4" />
... <input type="button" class="button" id="pagenav_ibtn" value="Go" />
... </form>
... </td>
... </tr>
... </table>
... </div>
... <!-- / PAGENAV POPUP -->
... </html>'''
>>> root = etree.fromstring(doc)
>>> nodes = root.getchildren()
>>> nodes
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!-- / PAGENAV POPUP -->]
>>> for elt in nodes:
... for t in elt.itertext():
... print t
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "lxml.etree.pyx", line 1406, in lxml.etree._Element.itertext (src/lxml/lxml.etree.c:48845)
File "lxml.etree.pyx", line 2763, in lxml.etree.ElementTextIterator.__cinit__ (src/lxml/lxml.etree.c:64747)
File "iterparse.pxi", line 219, in lxml.etree.iterwalk.__init__ (src/lxml/lxml.etree.c:125303)
File "apihelpers.pxi", line 72, in lxml.etree._rootNodeOrRaise (src/lxml/lxml.etree.c:13689)
ValueError: Input object has no element: lxml.etree._Comment
如上图所示
>>> nodes
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!-- / PAGENAV POPUP -->]
注意:getchildren 已弃用。您可以使用列表。
>>> list(root)
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!-- / PAGENAV POPUP -->]
节点是元素和注释的列表。如果你检查如何 itertext()正在工作:
Creates a text iterator. The iterator loops over this element and all subelements, in document order, and returns all inner text.
另一方面,如果我不是在列表上迭代,而是直接在根元素上迭代:
>>> for t in root.itertext():
... print t
...
我得到了所有的文本和很多空格。 :)
如果您仍想迭代节点列表。您可以推断出性质
>>> [item.tag for item in nodes]
[<built-in function Comment>, 'div', <built-in function Comment>]
你也可以这样做
>>> [item.__class__ for item in nodes]
[<type 'lxml.etree._Comment'>, <type 'lxml.etree._Element'>, <type 'lxml.etree._Comment'>]
关于python - 来自 .itertext() 的 lxml 错误 "ValueError: Input object has no element: HtmlComment",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31059786/
我正在尝试使用 elt.itertext() (v3.5.0b1) 遍历子树的文本内容,如下所示: import lxml.html.soupparser as soupparser import r
我是一名优秀的程序员,十分优秀!