gpt4 book ai didi

python - 确实使用 BeautifulSoup python 抓取前 100 个工作结果

转载 作者:行者123 更新时间:2023-11-30 21:57:57 25 4
gpt4 key购买 nike

我是 python 网络抓取的新手,我想从 Indeed 抓取前 100 名工作结果,但我只能抓取第一页结果,即前 10 名。我正在使用 BeautifulSoup 框架。这是我的代码,有人可以帮我解决这个问题吗?

import urllib2
from bs4 import BeautifulSoup
import json

URL = "https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru%2C+Karnataka"
soup = BeautifulSoup(urllib2.urlopen(URL).read(), 'html.parser')

results = soup.find_all('div', attrs={'class': 'jobsearch-SerpJobCard'})

for x in results:
company = x.find('span', attrs={"class":"company"})
print 'company:', company.text.strip()

job = x.find('a', attrs={'data-tn-element': "jobTitle"})
print 'job:', job.text.strip()

最佳答案

分批更改 url 中的起始值,每批 10 次。您可以循环递增并添加添加变量

https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru%2C+Karnataka&start=0

https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=1

例如

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
results = []
url = 'https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start={}'
with requests.Session() as s:
for page in range(5):
res = s.get(url.format(page))
soup = bs(res.content, 'lxml')
titles = [item.text.strip() for item in soup.select('[data-tn-element=jobTitle]')]
companies = [item.text.strip() for item in soup.select('.company')]
data = list(zip(titles, companies))
results.append(data)
newList = [item for sublist in results for item in sublist]
df = pd.DataFrame(newList)
df.to_json(r'C:\Users\User\Desktop\data.json')

关于python - 确实使用 BeautifulSoup python 抓取前 100 个工作结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55097699/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com