python-3.x - 网页抓取 : Page exists but getting 404 using requests/urllib-6ren

python-3.x - 网页抓取 : Page exists but getting 404 using requests/urllib

转载作者：行者123 更新时间：2023-12-02 20:37:00

24

4

我正在尝试抓取以下页面: http://usbcdirectory.com/listing/1-us-black-chambers

我使用的是Python 3.5.0

这是我的代码:

urllib.request.urlopen('http://usbcdirectory.com/listing/1-us-black-chambers')

使用上面的内容，我收到 404 未找到错误。但是，当我从浏览器打开该页面时，该页面存在。

我尝试寻找此问题的解决方案，以下是我发现的内容:

将 urllib 更改为 requests:我已经这样做了，并且状态代码中出现 404 错误

>>>requests.get('http://usbcdirectory.com/listing/1-us-black-chambers')
    
Request <404>

我检查了我的链接是否正确

我试图查明该页面是否是使用 JavaScript 生成的。我相信事实并非如此。

这里的网页有什么问题？他们是否以某种方式阻止抓取，或者是 URL 的问题？

最佳答案

正如您所猜测的，他们可能会阻止您的请求。您可以传递自定义 header 来模拟您的请求，就像来自真实浏览器的请求一样:

import requests

url = 'http://usbcdirectory.com/listing/1-us-black-chambers'
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
print(response.status_code)

关于python-3.x - 网页抓取 : Page exists but getting 404 using requests/urllib，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46843293/

24

4

0

文章推荐： firebase - 使用 ngrx 找出 firestore

文章推荐： vagrant - 将文件从主机复制到 vagrant 虚拟机

文章推荐：从不同表中进行两次选择的 SQL 子查询给出随机选择

文章推荐： sql-server - SQL Server Management Studio 2012 不允许输入单引号

python - requests.request ('POST' 和 request.post 之间的区别
这两个句子有什么区别: res = requests.request('POST', url) 和 res = requests.request.post(url) 最佳答案它们几乎是一样的:htt
FaceBook API : Get the Request Object for a request Id - logged into the account that sent the request. 使用 "Requests Dialog"API
我正在使用“请求对话框”来创建 Facebook 请求。为了让用户收到请求，我需要使用图形 API 访问 Request 对象。我已经尝试了大多数看起来合适的权限设置(read_requests 和
python - http.client.HTTPConnection.request 与 urllib.request.Request
urllib.request和http.client都是python标准库。前者相关方法的文档是 here后者，here (我使用的是3.5) 有谁知道为什么标准库中有两种方法看起来做同样的事情，或者
Python 扭曲错误 : "Request.write called on a request after Request.finish was called"
我是 Twisted 的新手，我不明白为什么在运行我的脚本时会出现此错误。\ 基本上，该脚本由 2 个页面组成，第一个页面是一个 HTML 表单，它调用自身执行一个阻塞方法并显示结果。当请求同时发送到
javascript - request.body 与 request.params 与 request.query
我有一个客户端 JS 文件，其中包含: agent = require('superagent'); request = agent.get(url); 然后我有类似的东西 request.get(u
javascript - 在 Rails 应用程序中提前输入 : Append JSON request to only one specific request instead of appending JSON request to every request via prefetch
提前输入功能可以正常工作。但问题是，提前输入功能会在每个数据请求上发出 JSON 请求，而实际上只应针对一个特定请求发生。我有以下 Controller : #controllers/agencie
request - 如何在中间件和处理程序中读取 Iron Request？
我正在使用 Rust 开发一个小型 API，我不确定如何在两个地方访问来自 Iron 的 Request。 Authentication 中间件为 token 读取一次Request，如果路径被允许(
cnzz统计代码引起的Bad Request - Request Too Long的原因分析
问题起因今天一位网友向我们反馈，用Chrome打开某些博客文章时，会出现"Bad Request - Request Too Long. HTTP Error 400. The siz
java - 领英 OAuth : "signature_invalid" response when requesting a POST HTTP request (for request token)
当我从 LinkedIn 向 https://api.linkedin.com/uas/oauth/requestToken 请求请求 token 时，出现以下错误: oauth_problem=si
android - Request(okhttp3.Request.Builder) 在 okhttp3.Request 中有私有(private)访问权限
我只是想使用 okhttp 下载一些字节数据，但在我完成代码之前，我遇到了一个问题，android studio 报告了一个错误，说“Request(okhttp3.Request.Builder)
node.js - 如何修复 Windows 10 中的 "npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142"错误？
我正在使用 Windows 10。我想在我的系统上使用 Angular 4。当我运行 node -v 和 npm -v 时，它会显示版本。但是当我执行语句 npm install -g @angula
rust - 无法编译 Iron 示例 : expected struct `iron::request::Request` , 找到结构 `iron::Request`
我正在尝试让一个简单的 Iron 示例起作用: extern crate iron; extern crate router; use iron::prelude::*; use iron::stat
python - Flask request.form 包含数据，但 request.data 为空且 request.get_json() 返回错误
我正在尝试使用嵌套字典“动态”创建一个数据输入表单(目前，我使用具有 3 个值的数组，但将来数组中的元素数量可能会有所不同)。这似乎工作正常，并且表单“正确”渲染了 html 模板(正确 = 我看到了
ASP.NET:使用 Request ["param"] 与使用 Request.QueryString ["param"] 或 Request.Form ["param"]
从 ASP.NET 中的代码隐藏访问表单或查询字符串值时，使用的优缺点是什么，例如: // short way string p = Request["param"]; 代替: // long way
ios - 如何处理这个 : There are five api requests running parallelly and 2nd request is dependent on 4th request's response
我遇到了一个问题，我想知道更好的解决方法。有五个 api 请求并行运行，第二个请求依赖于第四个请求的响应，但所有 5 个请求都已在运行。什么是更好的方法？需要建议。提前致谢。最佳答案调度地面工
python - urllib.request.Request 说参数无效
我收到以下错误:TypeError:序列项 0:预期字节、字节数组或具有缓冲区接口(interface)的对象、找到元组我检查了Python文档，urllib.request.Request的参数似
python - urllib.request.Request 超时参数错误
当我向函数添加超时参数时，我的代码总是进入异常并打印出“我失败了”。当我删除超时参数时，代码会正常工作，并进入 try 子句。关于超时参数如何在 urllib.request 函数中工作的任何信息？
php - preg_match html代码
我使用 cURL 向服务器发送请求这是链接:Server Side script for cURL request我用 file_get_contents('php://input'); 读取发送的数
java - org.apache.solr.common.SolrException : Bad Request Bad Request request: http://localhost:8080/solr/update? wt=javabin&version=2
请大家帮帮我我正在尝试使用 NUTCH 抓取网站，但它给我错误“java.io.IOException: Job failed!” 我正在运行此命令“bin/nutch solrindex http:
AngularJS 错误 : Unexpected request (No more requests expected)
在我的 AngularJS 应用程序中，我无法弄清楚如何对 then promise 的执行更改 location.url 进行单元测试。我有一个函数，登录，调用服务，身份验证服务 .它返回 pro

首页

博学

6Ren·AI

商城

python-3.x - 网页抓取 : Page exists but getting 404 using requests/urllib