gpt4 book ai didi

python - 使用 selenium 抓取页面链接总是返回有限数量的链接

转载 作者:行者123 更新时间:2023-12-04 12:04:46 24 4
gpt4 key购买 nike

我想从这个页面“https://m.aiscore.com/basketball/20210610”中抓取所有比赛链接,但只能获得限制数量的比赛:
我试过这个代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)

driver.maximize_window()
driver.implicitly_wait(60)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

soup = BeautifulSoup(driver.page_source, 'html.parser')

links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()
当我运行代码时,我总是只得到 16 个匹配链接,但页面有 35 个匹配。
我需要获取页面中的所有匹配链接。

最佳答案

由于滚动时正在加载站点,我尝试一次滚动一个屏幕,直到我们需要滚动到的高度大于窗口的总滚动高度。
我用过 set用于存储匹配链接以避免添加现有的匹配链接。
在运行这个时,我能够找到所有的链接。希望这对你也有用。

import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)
# Wait till the webpage is loaded
time.sleep(2)

# wait for 1sec after scrolling
scroll_wait = 1

# Gets the screen height
screen_height = driver.execute_script("return window.screen.height;")
driver.implicitly_wait(60)

# Number of scrolls. Initially 1
ScrollNumber = 1

# Set to store all the match links
ans = set()

while True:
# Scrolling one screen at a time until
driver.execute_script(f"window.scrollTo(0, {screen_height * ScrollNumber})")
ScrollNumber += 1

# Wait for some time after scroll
time.sleep(scroll_wait)

# Updating the scroll_height after each scroll
scroll_height = driver.execute_script("return document.body.scrollHeight;")

# Fetching the data that we need - Links to Matches
soup = BeautifulSoup(driver.page_source, 'html.parser')
for j in soup.select('.w100 .flex a'):
if j['href'] not in ans:
ans.add(j['href'])
# Break when the height we need to scroll to is larger than the scroll height
if (screen_height) * ScrollNumber > scroll_height:
break


print(f'Links found: {len(ans)}')
Output:

Links found: 61

关于python - 使用 selenium 抓取页面链接总是返回有限数量的链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67944248/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com