gpt4 book ai didi

python - urllib没有返回请求的内容

转载 作者:太空宇宙 更新时间:2023-11-03 13:56:20 25 4
gpt4 key购买 nike

我有两个页面想要废弃: url_1url_2

它们之间的唯一区别是 url_1 是同一域的第一页,而 url_2 是第三页。

我正在使用 urrlib 读取网址:

from urllib.request import urlopen
html_1 = urlopen(url_1).read()
html_2 = urlopen(url_2).read()

不幸的是,html_2html_1 具有相同的内容。四处阅读后,我发现发生这种情况可能是因为服务器将我视为机器人。因此,我使用 request 模块和 Beautiful Soup 来解析页面:

import requests
from bs4 import BeautifulSoup
session = requests.Session()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}

req_1 = session.get(url_1, headers=headers)
bsObj_1 = BeautifulSoup(req_1.text)
req_2 = session.get(url_2, headers=headers)
bsObj_2 = BeautifulSoup(req_2.text)

内容还是一样。我该如何修复它?

最佳答案

试试这个:

import requests
from bs4 import BeautifulSoup
import time

url_1 = 'https://www.zoekscholen.onderwijsinspectie.nl/zoek-en-vergelijk?searchtype=generic&zoekterm=&pagina=&filterSectoren=BVE'
url_2 = 'https://www.zoekscholen.onderwijsinspectie.nl/zoek-en-vergelijk?searchtype=generic&zoekterm=&pagina=3&filterSectoren=BVE'

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}

with requests.Session() as s:
s.headers.update(headers)
s.get('https://www.zoekscholen.onderwijsinspectie.nl/')
req_1 = s.get(url_1)
soup1 = BeautifulSoup(req_1.text, "lxml")
print(soup1.find("div", {"id": "mainResults"}).find_all("h2")[0].text)
time.sleep(1)
req_2 = s.get(url_2)
soup2 = BeautifulSoup(req_2.text, "lxml")
print(soup2.find("div", {"id": "mainResults"}).find_all("h2")[0].text)

输出:

Resultaten 1 - 20 van 165

Resultaten 41 - 60 van 165

关于python - urllib没有返回请求的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49573758/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com