gpt4 book ai didi

python - 从抓取的数据中分割 html (Python+BeautifulSoup4)

转载 作者:行者123 更新时间:2023-12-01 07:52:33 26 4
gpt4 key购买 nike

我遇到了一个问题,即在没有获取所有 html 数据的情况下抓取标签内的文本。这是我的 python 代码。我想要抓取的文本不在跨度类内,并且在标签中独立存在。这是文本放置位置的示例。

<a href="/counterstrike/rankings/team-details/32537">
<span class="ranking">49</span>
<span class="flag flag-pl" data-tooltip="" tabindex="1" title="Poland></span>
TEXT-I-WANT-TO-SCRAPE
<span class="elo">1103</span>
</a>

如果我使用“.text.encode('utf8').lstrip().rstrip()”函数,我仍然得到这样的数据:

print(textt) '49\n \n\n\n TEXT-I-WANT-TO-SCRAPE \n \n 1103'

我的问题是如何只获取标签内的文本?

抓取 elo 和排名没有问题,因为它们包含在具有特定类的跨度内。

def get_matches():
matches = get_parsed_page("https://www.gosugamers.net/counterstrike/rankings")
rankings = matches.find("ul", {"class": "ranking-list"})
matchdays = rankings.find_all("li")

for match in matchdays:
matchDetails = match.find_all("a")

for getMatch in matchDetails:
elo = match.find("span", {"class": "elo"}).text.encode('utf8').lstrip().rstrip()
ranking = match.find("span", {"class": "ranking"}).text.encode('utf8').lstrip().rstrip()
textt = match.find("a").text.encode('utf8').lstrip().rstrip()

print(ranking,elo,textt)

致以诚挚的问候

最佳答案

使用next_element获取标签的下一个元素的文本。尝试下面的代码。使用正则表达式查找特定的href抓取 .

from bs4 import BeautifulSoup
import requests
import re
data=requests.get("https://www.gosugamers.net/counterstrike/rankings").text
soup=BeautifulSoup(data,'html.parser')
for a in soup.find_all('a',href=re.compile("/counterstrike/rankings/team-details")):
ranking=a.find('span' , class_='ranking').text.replace('\n','').strip()
name=a.find('span', class_='ranking').next_element.next_element.next_element.next_element.replace('\n','').strip()
elo=a.find('span',class_='elo').text.replace('\n','').strip()
print(ranking,name,elo)

输出:

1 Astralis 1505
2 Team Liquid 1469
3 ENCE eSports 1402
4 Vitality 1365
5 AVANGAR 1326
6 Natus Vincere 1298
7 Ninjas in Pyjamas 1294
8 fnatic 1292
9 MiBR 1269
10 FURIA 1264
11 mousesports 1258
12 Renegades 1252
13 NRG eSports 1248
14 ORDER 1240
15 Grayhound Gaming 1237
16 Valiance 1235
17 Windigo 1228
18 FaZe Clan 1222
19 North 1220
20 G2 Esports 1213
21 OpTic Gaming 1201
22 MVP PK 1196
23 Heroic 1183
24 Chiefs eSports Club 1177
25 3DMAX.CS 1173
26 HellRaisers 1168
27 Rogue 1167
28 BIG 1165
29 forZe 1165
30 Ghost Gaming 1159
31 Swole Patrol 1154
32 TyLoo 1151
33 Red Reserve 1142
34 Isurus Gaming 1142
35 Team Kinguin 1136
36 Tainted Minds 1135
37 Movistar Riders 1134
38 NoChance 1134
39 DETONA Gaming 1132
40 Space Soldiers 1120
41 Bravado Gaming 1117
42 BPro Gaming 1116
43 Cloud9 1116
44 GamerLegion 1113
45 CyberZen 1111
46 Epsilon 1111
47 CLG Red 1107
48 Luminosity Gaming 1107
49 devils.one 1103
50 Sprout 1096

关于python - 从抓取的数据中分割 html (Python+BeautifulSoup4),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56130493/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com