gpt4 book ai didi

python - 为 soup.select() 正确的 div 类组合

转载 作者:行者123 更新时间:2023-11-28 01:05:17 24 4
gpt4 key购买 nike

我正在开发一些抓取代码,它不断返回一些错误,我想其他人可能会提供帮助。

首先我运行这个片段:

import  pandas as pd
from urllib.parse import urljoin
import requests

base = "http://www.reed.co.uk/jobs"

url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"
r = requests.get(url).content
soup = BShtml(r, "html.parser")

df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
df

我在今天的职位发布的第一页上运行了上面的代码片段。然后我提取页面底部的 URL,以便及时找到该时间点存在的页面总数。下面的正则表达式帮我解决了这个问题:

df['partone'] = df['links'].str.extract('([a-z][a-z][a-z][a-z][a-z][a-z]=[0-9][0-9].)', expand=True)
df['maxlink'] = df['partone'].str.extract('([0-9][0-9][0-9])', expand=True)
pagenum = df['maxlink'][4]
pagenum = pd.to_numeric(pagenum, errors='ignore')

请注意上面的第三行,页面数始终包含在此列表中倒数第二个(五个中的)URL 中。我敢肯定有一种更优雅的方法可以做到这一点,但它就足够了。然后我将从 URL 中获取的数字输入一个循环:

result_set = []

loopbasepref = 'http://www.reed.co.uk/jobs?cached=True&pageno='
loopbasesuf = '&datecreatedoffset=Today&pagesize=100'
for pnum in range(1,pagenum):
url = loopbasepref + str(pnum) + loopbasesuf
r = requests.get(url).content
soup = BShtml(r, "html.parser")
df2 = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div", class_="results col-xs-12 col-md-10")])
result_set.append(df2)
print(df2)

这是我遇到错误的地方。我正在尝试做的是遍历所有列出作业的页面,从第 1 页开始到第 N 页,其中 N = pagenum,然后提取链接到每个单独作业页面的 URL 并将其存储在数据框中.我已经尝试了 soup.select("div", class_="") 的各种组合,但每次都会收到错误消息:TypeError: select() got an unexpected keyword argument ' class_'.

如果有人对此有任何想法,并且可以看到前进的好方法,我将不胜感激!

干杯

克里斯

最佳答案

你可以一直循环直到没有下一页:

import  requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base = "http://www.reed.co.uk"
url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"

def all_urls():
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
# get the urls from the first page
yield [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
nxt = soup.find("a", title="Go to next page")
# title="Go to next page" is missing when there are no more pages
while nxt:
# wash/repeat until no more pages
r = requests.get(urljoin(base, nxt["href"])).content
soup = BeautifulSoup(r, "html.parser")
yield [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
nxt = soup.find("a", title="Go to next page")

只需遍历生成器函数即可从每个页面获取 url:

for u in (all_urls()):
print(u)

我还在选择器中使用了 a[href^=/jobs],因为还有其他匹配的标签,所以我们确保只提取工作路径。

在您自己的代码中,使用选择器的正确方法是:

soup.select("div.results.col-xs-12.col-md-10")

您的语法适用于 findfind_all,其中您对 css 类使用 class_=...:

soup.find_all("div", class_="results col-xs-12 col-md-10")

但这无论如何都不是正确的选择器。

不确定您为什么要创建多个 dfs,但如果这是您想要的:

def all_urls():
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
columns=["Links"])
nxt = soup.find("a", title="Go to next page")
while nxt:
r = requests.get(urljoin(base, nxt["href"])).content
soup = BeautifulSoup(r, "html.parser")
yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
columns=["Links"])
nxt = soup.find("a", title="Go to next page")


dfs = list(all_urls())

这会给你一个 dfs 列表:

In [4]: dfs = list(all_urls())
dfs[0].head()
In [5]: dfs[0].head(10)
Out[5]:
Links
0 http://www.reed.co.uk/jobs/tufting-manager/308...
1 http://www.reed.co.uk/jobs/financial-services-...
2 http://www.reed.co.uk/jobs/head-of-finance-mul...
3 http://www.reed.co.uk/jobs/class-1-drivers-req...
4 http://www.reed.co.uk/jobs/freelance-middlewei...
5 http://www.reed.co.uk/jobs/sage-200-consultant...
6 http://www.reed.co.uk/jobs/bereavement-support...
7 http://www.reed.co.uk/jobs/property-letting-ma...
8 http://www.reed.co.uk/jobs/graduate-recruitmen...
9 http://www.reed.co.uk/jobs/solutions-delivery-...

但是,如果您只想要一个,那么请使用带有 itertools.chain 的原始代码:

 from itertools import chain
df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_urls())))

这将在一个 df 中为您提供所有链接:

In [7]:  from itertools import chain
...: df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_
...: urls())))
...:

In [8]: df.size
Out[8]: 675

关于python - 为 soup.select() 正确的 div 类组合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40057058/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com