gpt4 book ai didi

python - 禁用 lxml 中 '--' 的注释检查

转载 作者:行者123 更新时间:2023-11-30 23:02:16 28 4
gpt4 key购买 nike

用例:

解析失败 https://www.banca-romaneasca.ro/en/tools-and-resources/与lxml。

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'

它来自 lxml > https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx#L3017

https://www.banca-romaneasca.ro/en/tools-and-resources/ 中发现不好的评论

...
<script type="text/javascript" src="/_res/js/forms.js"></script>

<!-- Google Code for Remarketing Tag -->
<!--------------------------------------------------
Remarketing tags may not be associated with personally identifiable information or placed on pages related to sensitive categories. See more information and instructions on how to setup the tag on: http://google.com/ads/remarketingsetup
--------------------------------------------------->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 958631629;
var google_custom_params = window.google_tag_params;
...

请求解决方案,例如:

  • 禁用检查(xml 上的一些魔法、标志)

    if b'--' in text or text.endswith(b'-'):
    raise ValueError("Comment may not contain '--' or end with '-'")
  • 猴子修补(更改代码、注入(inject)...)

更新 1:

我使用 html5lib 并希望获得 html5 中可用的标签,如声音、部分、视频...。

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse( # do not work
document.content,
treebuilder="lxml",
namespaceHTMLElements=document.namespace,
encoding=document.encoding
)

版本:

  • html5lib==0.9999999
  • lxml==3.5.0(降级lxml也不是解决方案)

更新 2::

这似乎是 lxml 中的改进/问题 https://github.com/lxml/lxml/pull/172#issuecomment-169084439 .

等待lxml开发者反馈。

更新 3::

收到反馈,似乎是 html5lib 错误,github 上的最后一个开发版本已经修复了。

最佳答案

解决方案已经找到,基于github上的@opottone:

我尝试从 github 安装最新的 html5parser 。现在我只收到警告,而不是错误。

关于python - 禁用 lxml 中 '--' 的注释检查,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34595275/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com