gpt4 book ai didi

Python 网页抓取 - html 解析

转载 作者:行者123 更新时间:2023-12-01 01:18:21 25 4
gpt4 key购买 nike

我正在尝试从纳斯达克网站提取系统状态消息。以下是部分页面源码:

</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>

想要这样的输出:

System Status Messages
11:56:46 Systems are operating normally

这是我提取页面内容的方法:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])

这会产生很多不需要的内容。清理它的最佳方法是什么,特别是包含实际系统消息的行?现在是这样的...

<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>

谢谢!

最佳答案

您可以迭代 td 标签

from bs4 import BeautifulSoup as soup
s = soup(content, 'html.parser')
_start, *_, _end = [i.text for i in s.find_all('td')]
results = f'{s.h2.text}\n{_start} {_end}'
print(results)

输出:

System Status Messages
11:56:46 ET Systems are operating normally

如果您不希望输出中包含 ET,您可以使用 re.sub:

import re
...
results = f'{s.h2.text}\n{re.sub(" [A-Z]+", "", _start)} {_end}'

输出:

System Status Messages
11:56:46 Systems are operating normally

关于Python 网页抓取 - html 解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54096209/

25 4 0
文章推荐: jquery - 如何将背景颜色更改为
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com