gpt4 book ai didi

python - 返回抓取的结果数

转载 作者:太空宇宙 更新时间:2023-11-04 04:54:26 25 4
gpt4 key购买 nike

我正在尝试抓取 INDEED:COM。我需要 python 返回与工作“数据科学家”和城市“米兰”的研究相对应的结果数。我认为这可以通过“提取页面中显示的结果数”或计算搜索结果数(这是我在第 1 段和第 2 段中尝试做的)来完成。这是我一生中第一次使用 python,当这个简单的搜索是商业项目的起点时,我需要它来完成一个项目。你能帮我编程让它返回结果数量吗???非常感谢大家的帮助!!

##import something 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

##tell python what I am looking for
URL="""https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20"""
page = requests.get(URL)
soup = BeautifulSoup(page.text,"html.parser")
#print(soup.prettify())

##extract the job tile (didnt work)
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name="div",attrs={"class":"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
output = extract_job_title_from_result(soup)
print (output)

### 1) count the results
URL_for_count = "https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20".format(query, location)
soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')
results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))


### 2) reiterate the search through the different pages of Indeed, to get ALL of the results
##nober of results shown per page = 10
i = int(number_of_results/100)
for page_number in range(i + 1):
URL_for_results = "https://it.indeed.com/Milano,-Lombardia-offerte-lavoro-data-scientist".format(query, location, str(100 * page_number))
soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})

最佳答案

您可以使用 BeautifulSoup 中的 find_all 方法

from bs4 import BeautifulSoup as soup
import urllib
data = str(urllib.urlopen('https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20').read())
listing = soup(data, 'lxml')
jobs = [i.text[1:-1] for i in listing.find_all('h2')]
print(jobs)
print("number of jobs is: {}".format(len(jobs)))

输出:

[u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst']

number of jobs is: 10

编辑:获取前六页的数据:

final_data = [[b.text[1:-1] for b in soup(str(urllib.urlopen("https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start={}".format(10*i)).read()), "lxml").find_all('h2')] for i in range(6)]
lengths = list(map(len, final_data))
print(sum(lengths))

输出:

[[u'Data Scientist \u2013 Social Media Intelligence', u'DATA ANALYST', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Entry Specialist', u'Impiegato Data Entry', u'Data Scientist'], [u'Junior Data Scientist', u'DATA ANALYST JR \u2013 Milano', u'STAGE JUNIOR DATA ANALYST / DATA SCIENTIST BIG DATA', u'Machine Learning Scientist', u'Data Analyst', u'Data Analyst (Econometric modeling) Sede di Milano', u'Neolaureati in statistica, matematica, ingegneria-Data Scien...', u'Data Scientist', u'Data Scientist', u'Data Scientist'], [u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst'], [u'Collaboratori Data Entry', u'Data Scientist', u'DATA ENTRY', u'Consumer Data Scientist', u'DATA ANALYST', u'JUNIOR - RISK ADVISORY - TECHNOLOGY & DATA RISK - PRODUCTS &...', u'Data Manager Ematologia', u'Data Scientist', u'Esperto Tecnologie Big Data \u2013 Text Analysis \u2013 Data Mining', u'Data Entry'], [u'People Data Analyst', u'Data Integration Analyst \u2013 TIBCO', u'ORACLE BI - Big Data Analytics', u'Data Strategist', u'Data Governance Specialist', u'Big Data Specialist', u'Oracle Data Integrator Specialist', u'Innovation Analyst', u'Data Scientist', u'Big Data Engineer'], [u'JUNIOR BIG DATA ENGINEER', u'Junior Payment Analyst', u'Esperti BIG DATa e DWH', u'Data Warehouse Manager', u'Data Analyst', u'Big Data Engineer', u'data entry part time', u'Big Data & Datawarehouse Architect Location: Milano', u'Biomedical Signal/Image Processing Data Analyst', u'IT Big Data Engineer']]
[10, 10, 10, 10, 10, 10]
60

关于python - 返回抓取的结果数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47394281/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com