gpt4 book ai didi

python - 从非结构化 HTML 数据中解析链接和字符串

转载 作者:太空宇宙 更新时间:2023-11-04 01:18:22 24 4
gpt4 key购买 nike

我有一个如下所示的 HTML 字符串:

        <p>
Type: <a href="wee.html">Tough</a><br />

Main Type:
<a href='abnormal.html'>Abnormal</a> <br />


Wheel:
<a href='none.html'>None</a>,<a href='squared.html'>Squared</a>,<a href='triangle.html'>Triangled</a> <br />

Movement type: <a href=forward.html">Forward</a><br />

Level: <a href="beginner.html">Beginner</a><br />
Sport: <a href="no.html">No</a><br/>Force: <a href="pull.html">Pull</a><br/> <span style="float:left;">Your Rating:&nbsp;</span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">Login to rate</a></div><br />

</p>

换句话说,有点非结构化。我希望能够首先检测字符串 TypeMain Type 以及它们的链接(和链接文本)。我试过用正则表达式检测单词,但这没有任何用处。如何处理这种不可靠的数据?

最佳答案

如果我事先知道类别,如 TypeForce 等,那么提前准备一个列表可能会更容易。

代码:

from bs4 import BeautifulSoup as bsoup
import re

ofile = open("test.html", "rb")
soup = bsoup(ofile)
soup.prettify()

categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
f = soup.find(text=re.compile(category)).next_sibling
string = f.get_text()
ref = f.get("href")
print "%s %s (%s)" % (category, string, ref)

结果:

Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.2s]

如果这有帮助,请告诉我。

编辑:

如果它后面有多个元素,这将正确处理 Wheel

代码:

from bs4 import BeautifulSoup as bsoup, Tag
import re

ofile = open("unstructured.html", "rb")
soup = bsoup(ofile)
soup.prettify()

categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
wheel_list = []
f = soup.find(text=re.compile(category)).next_sibling
if category != "Wheel:":
string = f.get_text()
ref = f.get("href")
print "%s %s (%s)" % (category, string, ref)
else:
while f.name == "a":
content = f.contents[0]
res = f.get("href")
wheel_list.append("%s (%s)" % (content, res))
f = f.find_next()
ref = ", ".join(wheel_list)
print "%s %s" % (category, ref)

结果:

Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html), Squared (squared.html), Triangled (triangle.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.3s]

如果这有帮助,请告诉我们。

关于python - 从非结构化 HTML 数据中解析链接和字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22885295/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com