gpt4 book ai didi

python - 使用 BeautifulSoup Python 删除带有冒号的属性

转载 作者:太空宇宙 更新时间:2023-11-03 20:59:16 25 4
gpt4 key购买 nike

我有时会遇到具有奇怪属性的 html,例如 fb:share:layout

<a class="addthis_button_facebook_share" fb:share:layout="button_count" style="height:20px;"></a>

我不太确定它们叫什么(itemscopes?命名空间?)。

目前我用 python 中的 beautifulsoup4 解析 HTML。我想知道是否有办法删除或重命名包含这些冒号的所有属性。

谢谢

编辑:感谢你的回答。我最终是这样实现的:

    for tag in soup.find_all(True):
attrs = dict(tag.attrs)
for attr in attrs:
if ":" in attr:
del tag.attrs[attr]

最佳答案

试试这个。

from BeautifulSoup import BeautifulSoup

def _remove_attrs(soup):
tag_list = soup.findAll(lambda tag: len(tag.attrs) > 0)
for t in tag_list:
for attr, val in t.attrs:
del t[attr]
return soup


def example():
doc = '<html><head><title>test</title></head><body id="foo"><p class="whatever">junk</p><div style="background: yellow;">blah</div></body></html>'
print 'Before:\n%s' % doc
soup = BeautifulSoup(doc)
clean_soup = _remove_attrs(soup)
print 'After:\n%s' % clean_soup

您还可以尝试下面的方法以获取额外引用。

Remove all HTML attributes with BeautifulSoup except some tags( ...)

from bs4 import BeautifulSoup

# remove all attributes
def _remove_all_attrs(soup):
for tag in soup.find_all(True):
tag.attrs = {}
return soup

# remove all attributes except some tags
def _remove_all_attrs_except(soup):
whitelist = ['a','img']
for tag in soup.find_all(True):
if tag.name not in whitelist:
tag.attrs = {}
return soup

# remove all attributes except some tags(only saving ['href','src'] attr)
def _remove_all_attrs_except_saving(soup):
whitelist = ['a','img']
for tag in soup.find_all(True):
if tag.name not in whitelist:
tag.attrs = {}
else:
attrs = dict(tag.attrs)
for attr in attrs:
if attr not in ['src','href']:
del tag.attrs[attr]
return soup

希望对您有帮助。

关于python - 使用 BeautifulSoup Python 删除带有冒号的属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55808316/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com