gpt4 book ai didi

python - lxml strip_tags 导致 AttributeError

转载 作者:行者123 更新时间:2023-12-01 04:57:25 26 4
gpt4 key购买 nike

我需要清理一个 html 文件,例如删除多余的“span”标签。如果“跨度”与 css 文件中的字体粗细和字体样式的父节点具有相同的格式(我将其转换为字典以便更快查找),则该“跨度”被认为是多余的。

html 文件如下所示:

<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>

我已经存入字典的 CSS 样式:

{'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique', 
'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
'Title': 'font-style: oblique; text-align: center; font-weight: bold',
'norm': 'font-style: normal; text-align: center; font-weight: normal'}

所以,考虑到 <p Title><span id xxxxx> ,和<p norm><span bbbbbb> css 字典中的 font-weight 和 font-style 具有相同的格式,我想得到以下结果:

<p class= "Title">blablabla bla prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss aa </span> </p>

此外,还有一些跨度,我只需查看它们的 id 即可删除:如果它包含“af” - 我无需查看字典即可删除它们。

所以,在我的脚本中有:

from lxml import etree
from asteval import Interpreter

tree = etree.parse("filename.html")

aeval = Interpreter()
filedic = open('dic_file', 'rb')
fileread = filedic.read()
new_dic = aeval(fileread)

def no_af(tree):

for badspan in tree.xpath("//span[contains(@id, 'af')]"):
badspan.getparent().remove(badspan)

return tree

def no_normal():
no_af(tree)

for span in tree.xpath('.//span'):
span_id = span.xpath('@id')

for x in span_id:
if x in new_dic:
get_style = x
parent = span.getparent()
par_span =parent.xpath('@class')
if par_span:
for ID in par_span:
if ID in new_dic:

get_par_style = ID
if 'font-weight' in new_dic[get_par_style] and 'font-style' in new_dic[get_par_style]:

if 'font-weight' in new_dic[get_style] and 'font-style' in new_dic[get_style]:

if new_dic[get_par_style]['font-weight']==new_dic[get_style]['font-weight'] and new_dic[get_par_style]['font-style']==new_dic[get_style]['font-style']:

etree.strip_tags(parent, 'span')

print etree.tostring(tree, pretty_print =True, method = "html", encoding = "utf-8")

这会导致:

AttributeError: 'NoneType' object has no attribute 'xpath'

而且我知道正是“etree.strip_tags(parent, 'span')”行导致了错误,因为当我将其注释掉并在任何其他行之后进行打印时 - 一切正常。

另外,我不确定使用这个 etree.strip_tags(parent, 'span') 是否能满足我的需要。如果父级内部有多个具有不同格式的跨度怎么办?该命令是否会删除所有这些跨度?我实际上只需要删除一个跨度,即当前的跨度,该跨度是在函数开头处获取的,位于“for span in tree.xpath('.//span'):”

我一整天都在研究这个错误,我想我忽略了一些东西......我迫切需要你的帮助!

最佳答案

lxml 很棒,但它提供了相当低级的“etree”数据结构,并且没有最广泛的内置编辑操作集。您需要的是一个“展开”操作,您可以将其应用于各个元素,以将其文本、任何子元素及其“尾部”保留在树中,但不保留元素本身。这是这样的操作(加上所需的辅助函数):

def noneCat(*args):
"""
Concatenate arguments. Treats None as the empty string, though it returns
the None object if all the args are None. That might not seem sensible, but
it works well for managing lxml text components.
"""
for ritem in args:
if ritem is not None:
break
else:
# Executed only if loop terminates through normal exhaustion, not via break
return None

# Otherwise, grab their string representations (empty string for None)
return ''.join((unicode(v) if v is not None else "") for v in args)


def unwrap(e):
"""
Unwrap the element. The element is deleted and all of its children
are pasted in its place.
"""
parent = e.getparent()
prev = e.getprevious()

kids = list(e)
siblings = list(parent)

# parent inherits children, if any
sibnum = siblings.index(e)
if kids:
parent[sibnum:sibnum+1] = kids
else:
parent.remove(e)

# prev node or parent inherits text
if prev is not None:
prev.tail = noneCat(prev.tail, e.text)
else:
parent.text = noneCat(parent.text, e.text)

# last child, prev node, or parent inherits tail
if kids:
last_child = kids[-1]
last_child.tail = noneCat(last_child.tail, e.tail)
elif prev is not None:
prev.tail = noneCat(prev.tail, e.tail)
else:
parent.text = noneCat(parent.text, e.tail)
return e

现在您已经完成了分解 CSS 的部分工作,并确定一个 CSS 选择器 (span#id) 是否表明您想要将另一个选择器视为冗余规范 (p .class)。让我们扩展它并将其包装到一个函数中:

cssdict = { 'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
'Title': 'font-style: oblique; text-align: center; font-weight: bold',
'norm': 'font-style: normal; text-align: center; font-weight: normal'
}

RELEVANT = ['font-weight', 'font-style']

def parse_css_spec(s):
"""
Decompose CSS style spec into a dictionary of its components.
"""
parts = [ p.strip() for p in s.split(';') ]
attpairs = [ p.split(':') for p in parts ]
attpairs = [ (k.strip(), v.strip()) for k,v in attpairs ]
return dict(attpairs)

cssparts = { k: parse_css_spec(v) for k,v in cssdict.items() }
# pprint(cssparts)

def redundant_span(span_css_name, parent_css_name, consider=RELEVANT):
"""
Determine if a given span is redundant with respect to its parent,
considering sepecific attribute names. If the span's attributes
values are the same as the parent's, consider it redundant.
"""
span_spec = cssparts[span_css_name]
parent_spec = cssparts[parent_css_name]
for k in consider:
# Any differences => not redundant
if span_spec[k] != parent_spec[k]:
return False
# Everything matches => is redundant
return True

好的,准备工作完成,主要表演时间到了:

import lxml.html
from lxml.html import tostring

source = """
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
"""

h = lxml.html.document_fromstring(source)

print "<!-- before -->"
print tostring(h, pretty_print=True)
print

for span in h.xpath('//span[@id]'):
span_id = span.attrib.get('id', None)
parent_class = span.getparent().attrib.get('class', None)
if parent_class is None:
continue
if redundant_span(span_id, parent_class):
unwrap(span)

print "<!-- after -->"
print tostring(h, pretty_print=True)

产量:

<!-- before-->
<html><body>
<p class="Title">blablabla <span id="xxxxx">bla</span> prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss <span id="bbbbbb"> aa </span> </p>
</body></html>


<!-- after -->
<html><body>
<p class="Title">blablabla bla prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss aa </p>
</body></html>

更新

再想一想,您不需要unwrap。我使用它是因为它在我的工具箱中很方便。您可以通过使用标记-清除方法和 etree.strip_tags 来避免它,如下所示:

for span in h.xpath('//span[@id]'):
span_id = span.attrib.get('id', None)
parent_class = span.getparent().attrib.get('class', None)
if parent_class is None:
continue
if redundant_span(span_id, parent_class):
span.tag = "JUNK"
etree.strip_tags(h, "JUNK")

关于python - lxml strip_tags 导致 AttributeError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27067379/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com