python - requests-html 找不到页面元素-6ren

python - requests-html 找不到页面元素

转载作者：行者123 更新时间：2023-12-01 07:37:40

24

4

所以我尝试导航到此网址:https://www.instacart.com/store/wegmans/search_v3/horizon%201%25并使用 item-name item-row 类从 div 中抓取数据。但有两个主要问题，第一是 instacart.com 需要登录才能访问该网址，第二是大部分页面是使用 JavaScript 生成的。

我相信我已经解决了第一个问题，因为我的 session.post(...) 得到了 200 响应代码。我也非常确定 r.html.render() 应该通过在抓取 javascript 生成的 html 之前渲染它来解决第二个问题。不幸的是，我的代码中的最后一行只返回一个空列表，尽管事实上 selenium 获取这个元素没有问题。有谁知道为什么这不起作用？

from requests_html import HTMLSession
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
session = HTMLSession()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "alexanderjbusch@gmail.com", "password": "password"},
        "authenticity_token": token}
response = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(response)
r = session.get("https://www.instacart.com/store/wegmans/search_v3/horizon%201%25", headers=headers)
r.html.render()
print(r.html.xpath("//div[@class='item-name item-row']"))

最佳答案

使用 requests 模块和 BeautifulSoup 登录后，您可以使用我在评论中建议的链接来解析 json 中可用的所需数据。以下脚本应为您提供名称、数量、价格以及相关产品的链接。使用下面的脚本您只能获得 21 个产品。此 json 内容中有一个分页选项。您可以通过使用该分页来获取所有产品。

import json
import requests
from bs4 import BeautifulSoup

baseurl = 'https://www.instacart.com/store/'
data_url = "https://www.instacart.com/v3/retailers/159/module_data/dynamic_item_lists/cart_starters/storefront_canonical?origin_source_type=store_root_department&tracking.page_view_id=b974d56d-eaa4-4ce2-9474-ada4723fc7dc&source=web&cache_key=df535d-6863-f-1cd&per=30"

data = {"user": {"email": "alexanderjbusch@gmail.com", "password": "password"},
        "authenticity_token": ""}
headers = {
    'user-agent':'Mozilla/5.0',
    'x-requested-with': 'XMLHttpRequest'
}
with requests.Session() as s:

    res = s.get('https://www.instacart.com/',headers={'user-agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text, 'lxml')
    token = soup.select_one("[name='csrf-token']").get('content')

    data["authenticity_token"] = token

    s.post("https://www.instacart.com/accounts/login",json=data,headers=headers)
    resp = s.get(data_url, headers=headers)

    for item in resp.json()['module_data']['items']:
        name = item['name']
        quantity = item['size']
        price = item['pricing']['price']
        product_page = baseurl + item['click_action']['data']['container']['path']
        print(f'{name}\n{quantity}\n{price}\n{product_page}\n')

部分输出:

SB Whole Milk
1 gal
$3.90
https://www.instacart.com/store/items/item_147511418

Banana
At $0.69/lb
$0.26
https://www.instacart.com/store/items/item_147559922

Yellow Onion
At $1.14/lb
$0.82
https://www.instacart.com/store/items/item_147560764

关于python - requests-html 找不到页面元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56909378/

24

4

0

文章推荐： java - 如何使用Java中的定时器来运行Job指定的时间？

文章推荐： java - 图像图标不显示

文章推荐： java - Java中的PreparedStatement插入

python - requests.request ('POST' 和 request.post 之间的区别
这两个句子有什么区别: res = requests.request('POST', url) 和 res = requests.request.post(url) 最佳答案它们几乎是一样的:htt
FaceBook API : Get the Request Object for a request Id - logged into the account that sent the request. 使用 "Requests Dialog"API
我正在使用“请求对话框”来创建 Facebook 请求。为了让用户收到请求，我需要使用图形 API 访问 Request 对象。我已经尝试了大多数看起来合适的权限设置(read_requests 和
python - http.client.HTTPConnection.request 与 urllib.request.Request
urllib.request和http.client都是python标准库。前者相关方法的文档是 here后者，here (我使用的是3.5) 有谁知道为什么标准库中有两种方法看起来做同样的事情，或者
Python 扭曲错误 : "Request.write called on a request after Request.finish was called"
我是 Twisted 的新手，我不明白为什么在运行我的脚本时会出现此错误。\ 基本上，该脚本由 2 个页面组成，第一个页面是一个 HTML 表单，它调用自身执行一个阻塞方法并显示结果。当请求同时发送到
javascript - request.body 与 request.params 与 request.query
我有一个客户端 JS 文件，其中包含: agent = require('superagent'); request = agent.get(url); 然后我有类似的东西 request.get(u
javascript - 在 Rails 应用程序中提前输入 : Append JSON request to only one specific request instead of appending JSON request to every request via prefetch
提前输入功能可以正常工作。但问题是，提前输入功能会在每个数据请求上发出 JSON 请求，而实际上只应针对一个特定请求发生。我有以下 Controller : #controllers/agencie
request - 如何在中间件和处理程序中读取 Iron Request？
我正在使用 Rust 开发一个小型 API，我不确定如何在两个地方访问来自 Iron 的 Request。 Authentication 中间件为 token 读取一次Request，如果路径被允许(
cnzz统计代码引起的Bad Request - Request Too Long的原因分析
问题起因今天一位网友向我们反馈，用Chrome打开某些博客文章时，会出现"Bad Request - Request Too Long. HTTP Error 400. The siz
java - 领英 OAuth : "signature_invalid" response when requesting a POST HTTP request (for request token)
当我从 LinkedIn 向 https://api.linkedin.com/uas/oauth/requestToken 请求请求 token 时，出现以下错误: oauth_problem=si
android - Request(okhttp3.Request.Builder) 在 okhttp3.Request 中有私有(private)访问权限
我只是想使用 okhttp 下载一些字节数据，但在我完成代码之前，我遇到了一个问题，android studio 报告了一个错误，说“Request(okhttp3.Request.Builder)
node.js - 如何修复 Windows 10 中的 "npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142"错误？
我正在使用 Windows 10。我想在我的系统上使用 Angular 4。当我运行 node -v 和 npm -v 时，它会显示版本。但是当我执行语句 npm install -g @angula
rust - 无法编译 Iron 示例 : expected struct `iron::request::Request` , 找到结构 `iron::Request`
我正在尝试让一个简单的 Iron 示例起作用: extern crate iron; extern crate router; use iron::prelude::*; use iron::stat
python - Flask request.form 包含数据，但 request.data 为空且 request.get_json() 返回错误
我正在尝试使用嵌套字典“动态”创建一个数据输入表单(目前，我使用具有 3 个值的数组，但将来数组中的元素数量可能会有所不同)。这似乎工作正常，并且表单“正确”渲染了 html 模板(正确 = 我看到了
ASP.NET:使用 Request ["param"] 与使用 Request.QueryString ["param"] 或 Request.Form ["param"]
从 ASP.NET 中的代码隐藏访问表单或查询字符串值时，使用的优缺点是什么，例如: // short way string p = Request["param"]; 代替: // long way
ios - 如何处理这个 : There are five api requests running parallelly and 2nd request is dependent on 4th request's response
我遇到了一个问题，我想知道更好的解决方法。有五个 api 请求并行运行，第二个请求依赖于第四个请求的响应，但所有 5 个请求都已在运行。什么是更好的方法？需要建议。提前致谢。最佳答案调度地面工
python - urllib.request.Request 说参数无效
我收到以下错误:TypeError:序列项 0:预期字节、字节数组或具有缓冲区接口(interface)的对象、找到元组我检查了Python文档，urllib.request.Request的参数似
python - urllib.request.Request 超时参数错误
当我向函数添加超时参数时，我的代码总是进入异常并打印出“我失败了”。当我删除超时参数时，代码会正常工作，并进入 try 子句。关于超时参数如何在 urllib.request 函数中工作的任何信息？
php - preg_match html代码
我使用 cURL 向服务器发送请求这是链接:Server Side script for cURL request我用 file_get_contents('php://input'); 读取发送的数
java - org.apache.solr.common.SolrException : Bad Request Bad Request request: http://localhost:8080/solr/update? wt=javabin&version=2
请大家帮帮我我正在尝试使用 NUTCH 抓取网站，但它给我错误“java.io.IOException: Job failed!” 我正在运行此命令“bin/nutch solrindex http:
AngularJS 错误 : Unexpected request (No more requests expected)
在我的 AngularJS 应用程序中，我无法弄清楚如何对 then promise 的执行更改 location.url 进行单元测试。我有一个函数，登录，调用服务，身份验证服务 .它返回 pro

首页

博学

6Ren·AI

商城

python - requests-html 找不到页面元素