gpt4 book ai didi

python - 我如何从 HTML 文件中提取我需要的数据?

转载 作者:行者123 更新时间:2023-11-28 20:00:18 27 4
gpt4 key购买 nike

这是我的 HTML:

p_tags = '''<p class="foo-body">
<font class="test-proof">Full name</font> Foobar<br />
<font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
<font class="test-proof">Current age</font> 27 years 226 days<br />
<font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
<font class="test-proof">Also</font> bar<br />
<font class="test-proof">foo style</font> hand <br />
<font class="test-proof">bar style</font> ball<br />
<font class="test-proof">foo position</font> bak<br />
<br class="bar" />
</p>'''

这是我的 Python 代码,使用 Beautiful Soup:

def get_info(p_tags):
"""Returns brief information."""

head_list = []
detail_list = []
# This works fine
for head in p_tags.findAll('font', 'test-proof'):
head_list.append(head.contents[0])

# Some problem with this?
for index in xrange(2, 30, 4):
detail_list.append(p_tags.contents[index])


return dict([(l, detail_list[head_list.index(l)]) for l in head_list])

我从 HTML 中获得了正确的 head_list,但 detail_list 不工作。

head_list = [u'Full name', u'Born', u'Current age', u'Major teams', u'Also', u'foo style', u'bar style', u'foo position']

我想要这样的东西

{  'Full name': 'Foobar',   'Born': 'July 7, 1923, foo, bar',   'Current age': '78 years 226 days',   'Major teams': 'Japan, Jakarta, bazz, foo, foobazz',   'Also': 'bar',   'foo style': 'hand',   'bar style': 'ball',   'foo position': 'bak'}

任何帮助都将不胜感激。提前致谢。

最佳答案

在我意识到你在使用“BeautifulSoup ”之前我就开始回答这个问题了,但我认为这是一个解析器,它适用于你使用 HTMLParser 库编写的示例字符串

from HTMLParser import HTMLParser

results = {}
class myParse(HTMLParser):

def __init__(self):
self.state = ""
HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag == "font" and attrs.has_key("class") and attrs['class'] == "test-proof":
self.state = "getKey"

def handle_endtag(self, tag):
if self.state == "getKey" and tag == "font":
self.state = "getValue"

def handle_data(self, data):
data = data.strip()
if not data:
return
if self.state == "getKey":
self.resultsKey = data
elif self.state == "getValue":
if results.has_key(self.resultsKey):
results[self.resultsKey] += " " + data
else:
results[self.resultsKey] = data


if __name__ == "__main__":
p_tags = """<p class="foo-body"> <font class="test-proof">Full name</font> Foobar<br /> <font class="test-proof">Born</font> July 7, 1923, foo, bar<br /> <font class="test-proof">Current age</font> 27 years 226 days<br /> <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br /> <font class="test-proof">Also</font> bar<br /> <font class="test-proof">foo style</font> hand <br /> <font class="test-proof">bar style</font> ball<br /> <font class="test-proof">foo position</font> bak<br /> <br class="bar" /></p>"""
parser = myParse()
parser.feed(p_tags)
print results

给出结果:

{'foo position': 'bak', 
'Major teams': 'Japan, Jakarta, bazz, foo, foobazz',
'Also': 'bar',
'Current age': '27 years 226 days',
'Born': 'July 7, 1923, foo, bar' ,
'foo style': 'hand',
'bar style': 'ball',
'Full name': 'Foobar'}

关于python - 我如何从 HTML 文件中提取我需要的数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/560936/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com