gpt4 book ai didi

Python代码获取源页面中表格的html数据

转载 作者:太空宇宙 更新时间:2023-11-03 18:43:56 25 4
gpt4 key购买 nike

我是Python新手,我正在尝试抓取一个网站。我可以登录网站并获取 html 页面,但我不需要整个页面,我只需要指定表格中的超链接。

我编写了以下代码,但这获取了所有超链接。

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in soup.findAll('a'):
print link.get('href')

谁能帮我看看我哪里出错了?

下面是表格的 html 文本

<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
<tr id="ctl00_Main_lvMyAccount_Tr1">
<td id="ctl00_Main_lvMyAccount_Td1">
<table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
<tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
<th id="ctl00_Main_lvMyAccount_Th1"></th>
<th id="ctl00_Main_lvMyAccount_Th2">

<a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
</font>

</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th4">
<a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th5">
<a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th6">
<a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th3"></th>
</tr>


<tr>
<td>

提前致谢。

最佳答案

嗯,这是正确的方法。

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute

此外,您可以跳过父循环,因为指定的 id 只有一个匹配项:

soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute

更新:注意到@DSM 所说的。修复了表格分配中缺少的引号。

关于Python代码获取源页面中表格的html数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19953593/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com