gpt4 book ai didi

python - 'clean up' html 文本的最佳方式

转载 作者:太空狗 更新时间:2023-10-30 01:28:24 25 4
gpt4 key购买 nike

我有以下文字:

"It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth,
nature, diversity, and history &#8211; all inside the prison of
your mind! Where else can you..."

我想做的是删除 html 标签并将其编码为 un​​icode。我目前正在做:

def remove_tags(text):
return TAG_RE.sub('', text)

只剥离标签。我将如何为数据库存储正确编码上述内容?

最佳答案

您可以尝试通过 HTML 解析器传递您的文本。这是一个使用 BeautifulSoup 的例子:

from bs4 import BeautifulSoup

text = '''It's the show your only friend and pastor have been talking about!
<i>Wonder Showzen</i> is a hilarious glimpse into the black
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth,
nature, diversity, and history &#8211; all inside the prison of
your mind! Where else can you...'''

soup = BeautifulSoup(text)

>>> soup.text
u"It's the show your only friend and pastor have been talking about! \nWonder Showzen is a hilarious glimpse into the black \nheart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, \nnature, diversity, and history \u2013 all inside the prison of \nyour mind! Where else can you..."

您现在有一个 unicode 字符串,其中 HTML 实体已转换为 unicode 转义字符,即 已转换为 \u2013

这也会删除 HTML 标签。

关于python - 'clean up' html 文本的最佳方式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32131901/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com