gpt4 book ai didi

python - 如何在 Beautiful Soup 中找到符合特定条件的元素

转载 作者:行者123 更新时间:2023-12-01 05:36:12 25 4
gpt4 key购买 nike

我正在学习并尝试Python (2.7)Beautiful Soup (3.2.0)。我已经在这里得到了一些帮助来解决我的第一个问题 ( Beautiful Soup throws `IndexError` )

这是到目前为止的 Python 代码:

# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)

# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)

# From the source HTML page, search and store all <div class="date">...</div> and it's content
datesDiv = soup.findAll('div', { "class" : "date" })
# Loop through the tag and store only the needed information, being the actual date
dates = [tag.contents[0] for tag in datesDiv]

# From the source HTML page, search and store all <span class="time">...</span> and it's content
timesSpan = soup.findAll('span', { "class" : "time" })
# Loop through the tag and store only the needed information, being the actual times
times = [tag.contents[0] for tag in timesSpan]

# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
# if tag.contents[1] != 'Dutch KNVB Beker' - Do a direct test if output is needed or not
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']

# From the source HTML page, search and store all <td class="away">..</td> and it's content
# [1:] at the end meand slice the first one found
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]

# From the source HTML page, search and store all <a class="broadcast" href="...">..</a> and it's content
broadcastsA = soup.findAll('a', { "class" : "broadcast" })
# Loop through the tag and store only the needed information, being the the broadcast URL, where we can find the streams
broadcasts = [tag['href'] for tag in broadcastsA]

我遇到的问题是数组彼此不相等:

len(dates)      #9, should be 6
len(times) #18, should be 12
len(hometeams) #6, is correct
len(awayteams) #6, is correct
len(broadcasts) #9, should be 6

我遇到的问题是我进行了以下搜索以获取 dates数组:soup.findAll('div', { "class" : "date" }) 。这显然给了我所有的 <div>具有 class="date" 的元素。但问题是,我只需要有 <td> 的日期元素为 class="away" .

查看我正在抓取的 HTML 的下一部分:

<tr class="odd">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
<a href="/competition.php?part=sports&amp;competitionid=92&amp;discipline=football">Dutch Cup</a>
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380054900">Tuesday, September 24</div> <!-- This date is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380054900">22:35</span> - <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380058500">23:35</span> <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
</td>
<td class="home" colspan="3">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>Dutch KNVB Beker<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221554&amp;part=sports">Live</a> <!-- This href is not needed, because within this <tr> there is no <td class="away"> -->
</td>
</tr>
<tr class="even">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
<a href="/competition.php?part=sports&amp;competitionid=92&amp;discipline=football">Dutch Cup</a>
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380127500">Wednesday, September 25</div> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380127500">18:45</span> - <!-- This time we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380134700">20:45</span> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
<td class="home">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>PSV<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3"/>
</td>
<td>vs.</td>
<td class="away">
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428"/>Stormvogels Telstar<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221555&amp;part=sports">Live</a> <!-- This href we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
</tr>

最佳答案

重新考虑一下抓取数据的方式怎么样?您有一个包含匹配项的表 - 然后只需迭代行即可:

for tr in soup.findAll('tr', {'class': ['odd', 'even']}):
home_team = tr.find('td', {'class': 'home'}).text
if home_team == 'Dutch KNVB Beker':
continue

away_team = tr.find('td', {'class': 'away'}).text
date = ' - '.join([span.text for span in tr.findAll('span', {'class': 'time'})])
broadcast = tr.find('a', {'class': 'broadcast'})['href']

print home_team, away_team, date, broadcast

打印 5 行:

RKC Waalwijk Heracles 20:45 - 22:45 /broadcast.php?matchid=221553&part=sports
PSV Stormvogels Telstar 18:45 - 20:45 /broadcast.php?matchid=221555&part=sports
Ajax FC Volendam 20:45 - 22:45 /broadcast.php?matchid=221556&part=sports
SC Heerenveen FC Twente 18:45 - 20:45 /broadcast.php?matchid=221558&part=sports
Feyenoord FC Dordrecht 20:45 - 22:45 /broadcast.php?matchid=221559&part=sports

然后,您可以将结果收集到字典列表中。

关于python - 如何在 Beautiful Soup 中找到符合特定条件的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18990848/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com