gpt4 book ai didi

python - 如何抓取带有无限滚动条的网站?

转载 作者:行者123 更新时间:2023-12-04 14:51:20 26 4
gpt4 key购买 nike

下面的代码是我到目前为止的代码,但它只提取前 25 个项目的数据,这是页面上的前 25 个项目,然后向下滚动以获取更多信息:

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

start_time = time.time()
s = requests.Session()

#Get URL and extract content
response = s.get('https://www.linkedin.com/jobs/search?keywords=It%20Business%20Analyst&location=Boston%2C%20Massachusetts%2C%20United%20States&geoId=102380872&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0')
soup = BeautifulSoup(response.text, 'html.parser')

items = soup.find('ul', {'class': 'jobs-search__results-list'})
job_titles = [i.text.strip('\n ') for i in items.find_all('h3', {'class': 'base-search-card__title'})]
job_companies = [i.text.strip('\n ') for i in items.find_all('h4', {'class': 'base-search-card__subtitle'})]
job_locations = [i.text.strip('\n ') for i in items.find_all('span', {'class': 'job-search-card__location'})]
job_links = [i["href"].strip('\n ') for i in items.find_all('a', {'class': 'base-card__full-link'})]

a = pd.DataFrame({'Job Titles': job_titles})
b = pd.DataFrame({'Job Companies': job_companies})
c = pd.DataFrame({'Job Locations': job_locations})

value_counts1 = a['Job Titles'].value_counts()
value_counts2 = b['Job Companies'].value_counts()
value_counts3 = c['Job Locations'].value_counts()

l1 = [f"{key} - {value_counts1[key]}" for key in value_counts1.keys()]
l2 = [f"{key} - {value_counts2[key]}" for key in value_counts2.keys()]
l3 = [f"{key} - {value_counts3[key]}" for key in value_counts3.keys()]

data = l1, l2, l3
df = pd.DataFrame(
data, index=['Job Titles', 'Job Companies', 'Job Locations'])

df = df.T

print(df)
print("--- %s seconds ---" % (time.time() - start_time))

我想提取超过前 25 个项目的数据,是否有一种有效的方法可以做到这一点?

最佳答案

通过检查获取包含所需数据的容器,您可以使用 window.scrollTo()

使用 Selenium 网络驱动程序从无限滚动页面中抓取

查看更多>

crawl site that has infinite scrolling using python

或者这个web-scraping-infinite-scrolling-with-selenium

关于python - 如何抓取带有无限滚动条的网站?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69046183/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com