gpt4 book ai didi

python - 使用 BeautifulSoup4 进行数据抓取的问题

转载 作者:行者123 更新时间:2023-12-05 07:03:55 24 4
gpt4 key购买 nike

所以基本上我正在尝试抓取工作网站,我的目标是检索职位、公司、薪水、位置。我打算进入 csv 文件,这样我就可以对其进行一些绘图。我当前的代码是:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.cvbankas.lt/?miestas=Vilnius&padalinys%5B0%5D=76&page=1'

#Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#HTML parser
page_soup = soup(page_html, 'html.parser')
# grabs each product
containers = page_soup.findAll('div',{'class':'list_a_wrapper'})

contain = containers[0]
container = containers[0]
print(container.h3)

然后返回我:

<h3 class="list_h3" lang="en">Senior Talent Manager</h3>

如果我问:container.h3['class'] 这将返回 ['h3_class'] ,如果我问:container.h3['lang ']我得到 en 但我无法检索 Senior Talent Manager

这是添加 HTML 代码的工作:

<div class="list_a_wrapper">
<div class="list_cell">
<h3 class="list_h3" lang="en">Senior Talent Manager</h3>
<span class="heading_secondary">
<span class="dib mt5">UAB „Omnisend“</span></span>
</div>
<div class="list_cell jobadlist_list_cell_salary">
<span class="salary_c">
<span class="salary_bl salary_bl_gross">
<span class="salary_inner">
<span class="salary_text">
<span class="salary_amount">2300-3300</span>
<span class="salary_period">€/mėn.</span>
</span>
<span class="salary_calculation">Neatskaičius mokesčių</span>
</span>
</span>
<div class="salary_calculate_bl js_salary_calculate_a" data-href="https://www.cvbankas.lt/perskaiciuoti-skelbimo-atlyginima-6732785">
<div class="button_action">Skaičiuoti »</div>
<div class="salary_calculate_text">Į rankas per mėn.</div>
</div>
</span> </div>
<div class="list_cell list_ads_c_last">
<span class="txt_list_1" lang="lt"><span class="list_city">Vilniuje</span></span>
<span class="txt_list_2">prieš 4 d.</span>
</div>
</div>

那么哪种方法最适合抓取:h3 中的 title、dib mt5、salary_amount、salary_calculation、list_city。

最佳答案

此脚本将从页面获取职位、公司、薪水、位置:

import requests
from bs4 import BeautifulSoup


url = 'https://www.cvbankas.lt/?miestas=Vilnius&padalinys%5B0%5D=76&page=1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for h3 in soup.select('h3.list_h3'):
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount").get_text(strip=True)
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))

打印:

UAB „Omnisend“                                     2300-3300       Vilniuje        Senior Talent Manager
UAB „BALTIC VIRTUAL ASSISTANTS“ Nuo 2700 Vilniuje SENIOR .NET C# DEVELOPER
UAB „Lexita“ 1200-2500 Vilniuje IT PROJEKTŲ VADOVAS (-Ė)
UAB „Nordcode technology“ 1200-2000 Vilniuje PHP developer (mid-level)
UAB „Nordcurrent Group“ Nuo 2300 Vilniuje SENIOR VAIZDO ŽAIDIMŲ TESTUOTOJAS
UAB „Inlusion Netforms“ 1500-3500 Vilniuje Senior C++ Programmer to work with Unreal (UE4) game engine
UAB „Solitera“ 1200-2800 Vilniuje Java(Spring Boot) Developer
UAB „Metso Lithuania“ Nuo 1300 Vilniuje BI DATA ANALYST
UAB „Atticae“ 1000-1500 Vilniuje PHP programuotojas (-a)
UAB „EIS Group Lietuva“ 2000-7000 Vilniuje SYSTEM ARCHITECT
UAB GF Bankas Nuo 1200 Vilniuje HelpDesk specialistas (-ė)
Tesonet 1000-3000 Vilniuje Swift Developer (Security Product)
UAB „Mark ID“ 1000-3000 Vilniuje Full Stack programuotojas

...and so on.

编辑:要保存为 csv,您可以使用此脚本:

import requests
import pandas as pd
from bs4 import BeautifulSoup


all_data = []
for page in range(1, 9):
url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=' + str(page)
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for h3 in soup.select('h3.list_h3'):
job_title = h3.get_text(strip=True)
company = h3.find_next(class_="heading_secondary").get_text(strip=True)
salary = h3.find_next(class_="salary_amount")
salary = salary.get_text(strip=True) if salary else '-'
location = h3.find_next(class_="list_city").get_text(strip=True)
print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))

all_data.append({
'Job Title': job_title,
'Company': company,
'Salary': salary,
'Location': location
})

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

保存 data.csv(来自 LibreOffice 的屏幕截图):

enter image description here

关于python - 使用 BeautifulSoup4 进行数据抓取的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63074397/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com