gpt4 book ai didi

python - 用正则表达式替换所有 html 标签属性

转载 作者:太空宇宙 更新时间:2023-11-03 15:37:13 26 4
gpt4 key购买 nike

我试图弄清楚如何添加属性 id=ID_<number>html 中的所有标签片段并删除另一个属性。

例如:

<div class="...">...</div>

至:

<div id="DIV_1">...</div>

DIV是大写的标签名称,_1 表示排序。所以如果这个<div>将是第二个标签,它将有 DIV_2 ID。排序采用 DFS 含义,因此如果 <div id="DIV_2">..</div>有一些 child 像 <div id="DIV_2"><ul class=".." style="..">...</ul></div>ul标签的 ID 为:UL_3 .

我尝试找到所有标签,然后删除它们的属性并一一添加它们的 ID。

re.findall(r'<([a-z][a-z0-9]*)\b[^>]*>',snippet)

这会找到所有标签。我的想法是这样的:

for i,tag in enumerate(tags):

remove_all_attributes_from_tag
get name of the tag and add set attribute "{}_{}".format(tag_name.upper,i)

但不知道如何继续。

代码片段:

<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>

最佳答案

首先用 id 结构和唯一标识符替换所有标签属性。第二步在循环中逐一替换唯一标识符。

代码

import re
html_orig = '<div id="wtab" class="pd_cont" style="display: table;"><div class="pd_colmn"><h4>Display</h4><span>5.20-inch</span></div><div class="pd_colmn"><h4>Processor</h4><span>2GHz octa-core</span></div><div class="pd_colmn"><h4>Front Camera</h4><span>8-megapixel</span></div><div class="pd_colmn"><h4>Resolution</h4><span>1080x1920 pixels</span></div><div class="pd_colmn"><h4>RAM</h4><span>3GB</span></div><div class="pd_colmn"><h4>OS</h4><span>Android 6.0</span></div><div class="pd_colmn"><h4>Storage</h4><span>32GB</span></div><div class="pd_colmn"><h4>Rear Camera</h4><span>16-megapixel</span></div><div class="pd_colmn"><h4>Battery Capacity</h4><span>2650mAh</span></div></div>'
html_edit = re.sub('(<[\w\d]+)(\s?[\w\d\s=;"_:]*)(>)',
'\g<1> id="DIV_!id!\g<3>', html_orig)
i = 1
while True:
sub = re.subn('!id!', str(i), html_edit, count=1)
if sub[1] == 0:
break
html_edit = sub[0]
i += 1

re.subn()返回一个包含 sub 数量的元组,这会启用循环的中断条件。

结果

'<div id="DIV_1><div id="DIV_2><h4 id="DIV_3>Display</h4><span id="DIV_4>5.20-inch</span></div><div id="DIV_5><h4 id="DIV_6>Processor</h4><span id="DIV_7>2GHz octa-core</span></div><div id="DIV_8><h4 id="DIV_9>Front Camera</h4><span id="DIV_10>8-megapixel</span></div><div id="DIV_11><h4 id="DIV_12>Resolution</h4><span id="DIV_13>1080x1920 pixels</span></div><div id="DIV_14><h4 id="DIV_15>RAM</h4><span id="DIV_16>3GB</span></div><div id="DIV_17><h4 id="DIV_18>OS</h4><span id="DIV_19>Android 6.0</span></div><div id="DIV_20><h4 id="DIV_21>Storage</h4><span id="DIV_22>32GB</span></div><div id="DIV_23><h4 id="DIV_24>Rear Camera</h4><span id="DIV_25>16-megapixel</span></div><div id="DIV_26><h4 id="DIV_27>Battery Capacity</h4><span id="DIV_28>2650mAh</span></div></div>'

关于python - 用正则表达式替换所有 html 标签属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42437845/

26 4 0