gpt4 book ai didi

python - BeautifulSoup 将主表中的多个表分组

转载 作者:行者123 更新时间:2023-12-01 08:12:20 26 4
gpt4 key购买 nike

我正在使用 BeautifulSoup 解析具有以下结构的 HTML 文档:

<table>
<tr>
<th>Thread</th>
<td> (555EEE555)<br/>
<table>
<tr>
<th>Participants</th>
<td>John Doe<br/>Jane Doe<br/>
</td>
</tr>
</table><br/><br/>
<table>
<tr>
<th>Author</th>
<td>John Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-16 19:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Test message with some body text<br/>
</td>
</tr>
</table><br/>
<table>
<tr>
<th>Author</th>
<td>Jane Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-17 08:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Second test message with some body text<br/>
</td>
</tr>
</table><br/>
</td>
</tr>
</table>

该消息结构在整个文档中重复出现。我需要通过对 AuthorSentBody 表进行分组来解析各个消息。这是我到目前为止的代码:

with open(path) as g:
soup = BeautifulSoup(g, 'html.parser')

table_parent = soup.find('td')

for idx, i in enumerate(table_parent.find_all('table', recursive=False)):
for x in i.find_all('table'):
print 'key: %s | data: %s' % (x.th.get_text(), x.td.get_text())

打印以下内容:

key: Current Participants | data: John DoeJane Doe
key: Author | data: John Doe
key: Sent | data: 2017-10-16 19:03:23 UTC
key: Body | data: Test message with some body text

如何编写代码来循环遍历整个文档并对每个 AuthorSentBody 进行适当分组以解析每个单独的个体消息?

最佳答案

我假设你总是有一个主表作为父表

你应该能够做到这一点:

from bs4 import BeautifulSoup as soup
import requests

html = """<table>
<tr>
<th>Thread</th>
<td> (555EEE555)<br/>
<table>
<tr>
<th>Participants</th>
<td>John Doe<br/>Jane Doe<br/>
</td>
</tr>
</table><br/><br/>
<table>
<tr>
<th>Author</th>
<td>John Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-16 19:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Test message with some body text<br/>
</td>
</tr>
</table><br/>
<table>
<tr>
<th>Author</th>
<td>Jane Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-17 08:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Second test message with some body text<br/>
</td>
</tr>
</table><br/>
</td>
</tr>
</table>"""

def _get_obj():
r = {
'Author': '',
'Sent': '',
'Body': ''
}
return r

page = soup(html, 'html.parser')

main_table = page.find('table')
result = []
r = _get_obj()

for t in main_table.find_all('table'):
if t.find('th', text='Author'):
r['Author'] = t.find('td').get_text()
if t.find('th', text='Sent'):
r['Sent'] = t.find('td').get_text()
if t.find('th', text='Body'):
r['Body'] = t.find('td').get_text()
result.append(r)
r = _get_obj()

print(result)

输出:

[
{'Author': 'John Doe', 'Sent': '2017-10-16 19:03:23 UTC\n', 'Body': 'Test message with some body text\n'},
{'Author': 'Jane Doe', 'Sent': '2017-10-17 08:03:23 UTC\n', 'Body': 'Second test message with some body text\n'}
]

关于python - BeautifulSoup 将主表中的多个表分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55165632/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com