python - lxml strip_tags 导致 AttributeError-6ren

python - lxml strip_tags 导致 AttributeError

转载作者：行者123 更新时间：2023-12-01 04:57:25

我需要清理一个 html 文件，例如删除多余的“span”标签。如果“跨度”与 css 文件中的字体粗细和字体样式的父节点具有相同的格式(我将其转换为字典以便更快查找)，则该“跨度”被认为是多余的。

html 文件如下所示:

<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>

我已经存入字典的 CSS 样式:

{'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique', 
 'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic', 
 'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic', 
 'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal', 
 'Title': 'font-style: oblique; text-align: center; font-weight: bold', 
 'norm': 'font-style: normal; text-align: center; font-weight: normal'}

所以，考虑到 <p Title>和<span id xxxxx> ，和<p norm>和<span bbbbbb> css 字典中的 font-weight 和 font-style 具有相同的格式，我想得到以下结果:

<p class= "Title">blablabla bla prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss aa </span> </p>

此外，还有一些跨度，我只需查看它们的 id 即可删除:如果它包含“af” - 我无需查看字典即可删除它们。

所以，在我的脚本中有:

from lxml import etree
from asteval import Interpreter

tree = etree.parse("filename.html")

aeval = Interpreter()
filedic = open('dic_file', 'rb')
fileread = filedic.read()
new_dic = aeval(fileread)

def no_af(tree):

  for badspan in tree.xpath("//span[contains(@id, 'af')]"):
      badspan.getparent().remove(badspan)

  return tree

def no_normal():
    no_af(tree)

  for span in tree.xpath('.//span'):
      span_id = span.xpath('@id')

      for x in span_id:
          if x in new_dic:
               get_style = x
               parent = span.getparent()
               par_span =parent.xpath('@class')
               if par_span:
                     for ID in par_span:
                        if ID in new_dic:

                           get_par_style = ID
                           if 'font-weight' in new_dic[get_par_style] and 'font-style' in new_dic[get_par_style]:

                              if 'font-weight' in new_dic[get_style] and 'font-style' in new_dic[get_style]:

                                 if new_dic[get_par_style]['font-weight']==new_dic[get_style]['font-weight'] and new_dic[get_par_style]['font-style']==new_dic[get_style]['font-style']:

                                     etree.strip_tags(parent, 'span')

    print etree.tostring(tree, pretty_print =True, method = "html", encoding = "utf-8")

这会导致:

AttributeError: 'NoneType' object has no attribute 'xpath'

而且我知道正是“etree.strip_tags(parent, 'span')”行导致了错误，因为当我将其注释掉并在任何其他行之后进行打印时 - 一切正常。

另外，我不确定使用这个 etree.strip_tags(parent, 'span') 是否能满足我的需要。如果父级内部有多个具有不同格式的跨度怎么办？该命令是否会删除所有这些跨度？我实际上只需要删除一个跨度，即当前的跨度，该跨度是在函数开头处获取的，位于“for span in tree.xpath('.//span'):”

我一整天都在研究这个错误，我想我忽略了一些东西......我迫切需要你的帮助!

最佳答案

lxml 很棒，但它提供了相当低级的“etree”数据结构，并且没有最广泛的内置编辑操作集。您需要的是一个“展开”操作，您可以将其应用于各个元素，以将其文本、任何子元素及其“尾部”保留在树中，但不保留元素本身。这是这样的操作(加上所需的辅助函数):

def noneCat(*args):
    """
    Concatenate arguments. Treats None as the empty string, though it returns
    the None object if all the args are None. That might not seem sensible, but
    it works well for managing lxml text components.
    """
    for ritem in args:
        if ritem is not None:
            break
    else:
        # Executed only if loop terminates through normal exhaustion, not via break
        return None

    # Otherwise, grab their string representations (empty string for None)
    return ''.join((unicode(v) if v is not None else "") for v in args)


def unwrap(e):
    """
    Unwrap the element. The element is deleted and all of its children
    are pasted in its place.
    """
    parent = e.getparent()
    prev = e.getprevious()

    kids = list(e)
    siblings = list(parent)

    # parent inherits children, if any
    sibnum = siblings.index(e)
    if kids:
        parent[sibnum:sibnum+1] = kids
    else:
        parent.remove(e)

    # prev node or parent inherits text
    if prev is not None:
        prev.tail = noneCat(prev.tail, e.text)
    else:
        parent.text = noneCat(parent.text, e.text)

    # last child, prev node, or parent inherits tail
    if kids:
        last_child = kids[-1]
        last_child.tail = noneCat(last_child.tail, e.tail)
    elif prev is not None:
        prev.tail = noneCat(prev.tail, e.tail)
    else:
        parent.text = noneCat(parent.text, e.tail)
    return e

现在您已经完成了分解 CSS 的部分工作，并确定一个 CSS 选择器 (span#id) 是否表明您想要将另一个选择器视为冗余规范 (p .class)。让我们扩展它并将其包装到一个函数中:

cssdict = { 'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
            'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
            'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
            'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
            'Title': 'font-style: oblique; text-align: center; font-weight: bold',
            'norm': 'font-style: normal; text-align: center; font-weight: normal'
          }

RELEVANT = ['font-weight', 'font-style']

def parse_css_spec(s):
    """
    Decompose CSS style spec into a dictionary of its components.
    """
    parts = [ p.strip() for p in s.split(';') ]
    attpairs = [ p.split(':') for p in parts ]
    attpairs = [ (k.strip(), v.strip()) for k,v in attpairs ]
    return dict(attpairs)

cssparts = { k: parse_css_spec(v) for k,v in cssdict.items() }
# pprint(cssparts)

def redundant_span(span_css_name, parent_css_name, consider=RELEVANT):
    """
    Determine if a given span is redundant with respect to its parent,
    considering sepecific attribute names. If the span's attributes
    values are the same as the parent's, consider it redundant.
    """
    span_spec = cssparts[span_css_name]
    parent_spec = cssparts[parent_css_name]
    for k in consider:
        # Any differences => not redundant
        if span_spec[k] != parent_spec[k]:
            return False
    # Everything matches => is redundant
    return True

好的，准备工作完成，主要表演时间到了:

import lxml.html
from lxml.html import tostring

source = """
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
"""

h = lxml.html.document_fromstring(source)

print "<!-- before -->"
print tostring(h, pretty_print=True)
print

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        unwrap(span)

print "<!-- after -->"
print tostring(h, pretty_print=True)

产量:

<!-- before-->
<html><body>
<p class="Title">blablabla <span id="xxxxx">bla</span> prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss <span id="bbbbbb"> aa </span> </p>
</body></html>


<!-- after -->
<html><body>
<p class="Title">blablabla bla prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss  aa  </p>
</body></html>

更新

再想一想，您不需要unwrap。我使用它是因为它在我的工具箱中很方便。您可以通过使用标记-清除方法和 etree.strip_tags 来避免它，如下所示:

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        span.tag = "JUNK"
etree.strip_tags(h, "JUNK")

关于python - lxml strip_tags 导致 AttributeError，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27067379/

文章推荐： jquery-ui - 防止JQuery的恶意覆盖

文章推荐： jquery - Kinetic js ipad滚动

文章推荐： jquery - 避免在 Yii 中捆绑 jquery

文章推荐： python - python "with"命令可以用于选择性地写入文件

PHP:strip_tags()是否去除自闭合XHTML标签？
PHP手册说 5.3.4 strip_tags() no longer strips self-closing XHTML tags unless the self-closing XHTML tag
php strip_tags 问题
我想从一个字符串中去除所有的 html 标签，除了 ( and )。我使用了 strip_tags();但它剥离了所有的 html 标签。问候最佳答案那些需要在第二个参数中指定。 echo
php strip_tags 并保留分离
我想解析这个: Text1 Text2 Text3 Text4 Text5 进入这个数组: [ "Text1", "Text2 Text3", "Text4", "Te
mysql - 如何在查询字段时使用 strip_tags
我在函数中编写了以下查询: $dsearch=mysql_real_escape_string($condition['title']); "select id, title , category
php - strip_tags() 容易受到脚本攻击吗？
是否存在已知的 XSS 或其他攻击可以使其通过 $content = "some HTML code"; $content = strip_tags($content); echo $content;
php - strip_tags 不工作
我想像这样过滤掉 html 字符 $user = $_POST["user"]; //Get username from mysql_real_escape_string($user); //Aga
php - strip_tags() 功能黑名单而不是白名单
我最近发现了 strip_tags()以字符串和可接受的 html 标记列表作为参数的函数。假设我想摆脱字符串中的图像，这是一个例子: $html = ''; $html = 'This shoul
php - strip_tags() 是否处理嵌入的标签
我不在一台安装了 PHP 的计算机旁，想知道 strip_tags() 的结果会出现在以下文本中:“ipt>警报(‘哦哦’)ipt>” 它会返回:“alert('oh oh')”(即不承认通过删除明显
php strip_tags 删除所有内容
我在用户输入上使用 strip 标签来删除所有可能的标签，但 strip_tags php 函数也会删除“.. @编辑:例如: .text >. asd asd bolded alert('this
php - strip_tags 不允许某些标签
基于 strip_tags 文档，第二个参数采用允许的标签。但是，就我而言，我想反其道而行之。假设我将接受标签 script_tags通常(默认)接受，但只去除标签。有什么可能的方法吗？我并不是说
php - strip_tags() .... 用空格替换标签而不是删除它们
您知道如何使用 php 将 html 标签替换为空格字符吗？如果我显示 strip_tags('Foobar'); 我得到的结果是“foobar”，但我需要将单词分开的是“foo bar”。最佳答
php - strip_tags() 是否容易受到脚本攻击？
是否存在已知的 XSS 或其他攻击使其无法通过 $content = "some HTML code"; $content = strip_tags($content); echo $content;
php - strip_tags() 是否容易受到脚本攻击？
是否存在已知的 XSS 或其他攻击使其无法通过 $content = "some HTML code"; $content = strip_tags($content); echo $content;
详解PHP函数 strip_tags 处理字符串缺陷bug
详解PHP函数 strip_tags 处理字符串缺陷bug PHP 函数 strip_tags() 是一个常用函数，该函数可以剥去字符串中的 HTML、XML 以及 PHP 的标签。极大方便了对字
PHP关于htmlspecialchars、strip_tags、addslashes的解释
PHP的htmlspecialchars、strip_tags、addslashes是网页程序开发中常见的函数，今天就来详细讲述这些函数的用法： 1.函数strip_tags：去掉 HTML 及
php 的 strip_tags() 不起作用
我正在尝试稍微调整一下 wordpress，但我的 php 级别为 0，所以我有点烂 :/ 我想添加一个自定义的“发推文”按钮(我知道已经有无数这样的按钮了，我只是想自己做，为了好玩) 所以，我正在尝
php - strip_tags + html 实体仅获取数字
我想从此字符串中删除除金额以外的所有内容: ₪700.00 我尝试过: strip_tags( $total_price_paid ); - 不够。 strip_tags( html_en
PHP 防止 strip_tags 删除损坏的标签
我和这个情况一样this guy . 基本上strip_tags删除标签，包括损坏的标签(documentation 中使用的术语)。是否有另一种不涉及删除 "); echo $body; 但是上面的
PHP 防止 strip_tags 删除损坏的标签
我和这个情况一样this guy . 基本上strip_tags删除标签，包括损坏的标签(documentation 中使用的术语)。是否有另一种不涉及删除 "); echo $body; 但是上面的
php - 如何自动将 'strip_tags' 附加到所有数据库选择查询？
我的 php db.php 中有这个 $mysqli = new mysqli($hostname, $user, $pass, $bd); foreach($_POST as $key => $va

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - lxml strip_tags 导致 AttributeError