gpt4 book ai didi

python - 使用用户代理 header 时 Webscraping CrunchBase 访问被拒绝

转载 作者:太空宇宙 更新时间:2023-11-04 02:00:54 24 4
gpt4 key购买 nike

我正在尝试通过网络抓取 Crunch Base 来查找某些公司的总融资额。 Here is a link举个例子。

起初,我尝试只使用漂亮的汤,但我一直收到错误提示:

Access to this page has been denied because we believe you are using automation tools to browse the\nwebsite.

然后我查看了如何伪造浏览器访问并更改了我的代码,但我仍然遇到相同的错误。我究竟做错了什么??

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

最佳答案

总而言之,您的代码看起来很棒!您尝试抓取的网站似乎需要比您拥有的网站更复杂的 header 。以下代码应该可以解决您的问题:

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content)

关于python - 使用用户代理 header 时 Webscraping CrunchBase 访问被拒绝,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55749558/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com