gpt4 book ai didi

python - BeautifulSoup:将标签(包含其他标签)拆分为两个字符串

转载 作者:行者123 更新时间:2023-12-05 08:04:47 27 4
gpt4 key购买 nike

我正在将一些 HTML 字典数据按摩到 XML 中以导入到 some dictionary software 中。

原始 HTML 看起来像这样:

<div class="entry">
<span class="headword">word</span>
<span class="pos">part of speech</span>
<span class="definition">sense1; sense2
<span class="example">(example2.1; example2.2)</span>
; sense3 <span class="example">(example3.1; example3.2)</span>
</span>
</div>

编辑:事实上,输入的类与输出的 XML 标签并不完全匹配。那只是为了在我的示例中阐明这种关系。我需要用特定的 XML 标记替换特定的类,但它们不匹配。

理想的最终结果应该是这样的:

<entry>
<headword>word</headword>
<pos>part of speech</pos>
<sense>
<definition>sense1</definition>
</sense>
<sense>
<definition>sense2</definition>
<example>example2.1</example>
<example>example2.2</example>
</sense>
<sense>
<definition>sense3</definition>
<example>example3.1</example>
<example>example3.2</example>
</sense>
</entry>

我汤的当前状态(已完成直接替换)是:

<entry>
<headword>word</headword>
<pos>part of speech</pos>
<definition>sense1; sense2
<example>example2.1</example>
<example>example2.2</example>
; sense3
<example>example3.1</example>
<example>example3.2</example>
</definition>
</entry>

映射 1:1 的划分很容易,将定义+示例包装在意义标记中也应该很容易,但问题是原始版本不加区别地使用 ; 来分隔意义和示例的方式。这意味着我需要先拆分 example 标签,然后拆分 definition 处的 ; 标签(即有效地将 ; 替换为 </example>\n<example></definition>\n<definition> )。自从我开始写这个问题以来,我已经想出了如何为例子做这件事(因为它们只包含字符串),但是定义很可能包含 <example> 标签他们自己,所以我不能只使用 split() 因为返回了一个列表 & 'list' object has no attribute 'split'

有没有更简单的方法来拆分包含其他标签的标签,或者我是否必须遍历结果集列表并重新创建所有标签?

tags = soup.find_all("example")
for tag in tags:
tag.string = re.sub(r"[()]", "", tag.string) # remove parentheses
egs = tag.string.split("; ") # or str(tag.contents).split("; ") ?
new = ""
if len(egs) > 1:
for eg in reversed(egs[1:]):
new = soup.new_tag("example")
new.string = eg
tag.insert_after(new)
tag.string = egs[0] # orig tag becomes 1st seg only

最佳答案

您可以检查每个元素的 soup.contents 并通过递归遍历 soup.contents 中的非字符串元素来构建结构:

from bs4 import BeautifulSoup, NavigableString
import re
def to_xml(d):
r, s, k = [], None, []
for i in filter(lambda x:x != '\n', d.contents):
if isinstance(i, NavigableString):
if s is not None:
r.append((s, k))
s = [j for i in re.sub('^\(|\)$', '', i).split('; ') if (j:=re.sub('^\W+|\W+$', '', i))]
k = []
else:
k.append(i)
r.append((s, k))
for a, b in r:
if a is not None:
if len(a) == 1 and not b:
yield f'<{(c:=" ".join(d["class"]))}>{a[0]}</{c}>\n'
elif not b:
yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format(c, c1, i, c1, c) if (c:=re.sub('[\d+\.]+$', '', i)) != (c1:=" ".join(d["class"])) else f"<{c}>{i}</{c}>" for i in a]
else:
yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', i)), (c1:=" ".join(d["class"])), i, c1, c) for i in a[:-1]]
yield "<{}>\n<{}>{}</{}>\n{}\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', a[-1])), (c1:=' '.join(d['class'])), a[-1], c1, '\n'.join(j for k in b for j in to_xml(k)), c)
else:
yield '<{}>{}</{}>'.format((c1:=" ".join(d["class"])), "\n".join(j for k in b for j in to_xml(k)), c1)


s = """
<div class="entry">
<span class="headword">word</span>
<span class="pos">part of speech</span>
<span class="definition">sense1; sense2
<span class="example">(example2.1; example2.2)</span>
; sense3 <span class="example">(example3.1; example3.2)</span>
</span>
</div>
"""
r = BeautifulSoup(''.join(to_xml(BeautifulSoup(s, 'html.parser').div)), 'html.parser')
print(r)

输出:

<entry>
<headword>word</headword>
<pos>part of speech</pos>
<sense>
<definition>sense1</definition>
</sense>
<sense>
<definition>sense2</definition>
<example>example2.1</example>
<example>example2.2</example>
</sense>
<sense>
<definition>sense3</definition>
<example>example3.1</example>
<example>example3.2</example>
</sense>
</entry>

关于python - BeautifulSoup:将标签(包含其他标签)拆分为两个字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67557465/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com