gpt4 book ai didi

python - 从 td 标签中抓取特定数据

转载 作者:太空宇宙 更新时间:2023-11-03 20:37:18 24 4
gpt4 key购买 nike

我必须抓取这些数据

  • 正在招聘的公司名称
  • 公司所在地
  • 广告所针对的位置

这是我要抓取的网站link 。我能够获取 td 数据,但我需要从特定的 td 标签开始(即从这个 tr 标签开始)

<tr style="height:14px"></tr>
<tr class='athing' id='20463814'>
<td align="right" valign="top" class="title"><span class="rank"></span></td> <td></td><td class="title"><a href="https://mino-games.workable.com/j/69BCF95C8F" class="storylink" rel="nofollow">Mino Games (YC W11) Is Hiring Game Developers in Montreal</a><span class="sitebit comhead"> (<a href="from?site=workable.com"><span class="sitestr">workable.com</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
<span class="age"><a href="item?id=20463814">11 hours ago</a></span> </td></tr>

然后继续转向其他标签,同时继续在单独的变量中获取公司名称、位置和职位的数据。我知道要求很高,但我非常感谢您能提供的任何帮助。

这是我尝试过的:

import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/jobs'

plain_html_text = requests.get(url);

soup = BeautifulSoup(plain_html_text.text, "html.parser")

table_body = soup.find('tbody')
rows = soup.find('tr')
for row in rows:
cols = row.find_all('td')
cols = [x.text.strip() for x in cols]
print (cols)

最佳答案

您想要的不是一个简单的问题,但这个脚本可以帮助您入门:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com/jobs'

plain_html_text = requests.get(url);

soup = BeautifulSoup(plain_html_text.text, "html.parser")

rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
t = title.get_text(strip=True)

company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
if company:
company = company[0].strip()
else:
company = '-'

position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
if position:
position = position[0].strip()
else:
position = '-'

location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
if location:
location = location[0].strip()
else:
location = '-'

rows.append([company, position, location])

print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
c, p, l = row
print('{: <50}{: <80}{: <20}'.format(c, p, l))

打印:

                     Company                                                          Position                                          Location      
Scale AI engineers to accelerate the development of AI -
Mino Games (YC W11) Game Developers Montreal
BuildZoom (YC W13) – Help us un-break construction -
Bitmovin (YC S15) a Video Solutions Architect/Software Engineer Brazil
Streak – CRM for Gmail (YC S11) Vancouver
ZeroCater (YC W11) a Director of Engineer SF
UpCodes (YC S17) engineers to automate compliance for architects -
Tech Nonprofit Upsolve (YC W19) a Software Engineer -
Gitlab (YC W15) an Engineering Manager, Ecosystem -
Saleswhale (YC S16) Our First U.S. Strategic Account Executive -
Jerry (YC S17) for a Director of Ops and Growth -
Sourceress (YC S17) Product and ML Engineers (Remote OK, No Prior ML OK) -
GiveCampus (YC S15) a Product Designer who cares about education -
Iris Automation an Account Executive for B2B Flying Vehicle Software -
LogDNA (YC W15) Software Engineers – DevOps Monitoring at Scale -
Flexport software engineers to work on our trucking apps Chicago
Mux an ML engineer to help train our machines to deliver better video -
The Muse (YC W12) a Product Director for Growth -
OneSignal an SRE to scale our bare-metal infrastructure -
Atomwise (YC W15) a Senior Systems/Cloud Engineer -
Demodesk (YC W19) Software Engineers Munich
Gusto for Android and iOS developers to build our native mobile app -
Fond (YC W12) an Engineering Manager Portland
ReadMe (YC W15) – Help us make APIs easy to use -
Keeper (YC W19) a lead engineer – help save gig workers money on taxes -
Asseta (YC S13) a technical lead -
Tesorio (YC S15) Engineering Managers, Senior Engineers -
Standard Cognition (YC S17) – Work on vision systems Rust
Curebase (YC S18) first sales hire – distributed clinical research -
Mashgin (YC W15) a Fullstack SWE Interested Computer Vision/AI

关于python - 从 td 标签中抓取特定数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57090405/

24 4 0