gpt4 book ai didi

python - 选择用\n 分隔的

标签

转载 作者:太空宇宙 更新时间:2023-11-04 02:56:31 25 4
gpt4 key购买 nike

我正在使用 Python 来解析/清理 html 文档,但它的格式很糟糕。例如

<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>

我想转换 <p>\n<p><p>但我似乎无法定位 \n<p> 之间的任意数量的空格标签。

到目前为止我尝试了什么

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(re.compile("<p>\\n+<p>", "<p>", html))

然而,这失败了。

最佳答案

使用以下方法:

html = "<p>\n<p>\n    Python initially inherited its parsing from C.  While this has been\ngenerally useful, there are some remnants which have been less useful\nfor Python, and should be eliminated.</p>\n</p>"
html = re.sub(r'<p>[\n\s]+<p>[\n\s]*|<(\/)p>[\n\s]+<\/p>[\n\s]*', r"<\1p>", html)

print(html)

输出:

<p>Python initially inherited its parsing from C.  While this has been
generally useful, there are some remnants which have been less useful
for Python, and should be eliminated.</p>

替换 r"<\1p>"暗示结束标记符号 /来自第一个捕获组 <(\/)p>如果匹配

关于python - 选择用\n 分隔的 <p> 标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42179469/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com