html - 使用 BeautifulSoup 抓取隐藏元素-6ren

html - 使用 BeautifulSoup 抓取隐藏元素

转载作者：行者123 更新时间：2023-12-05 03:12:13

24

4

我试图从我的项目的网站上抓取数据。但问题是我没有在我的开发人员工具栏屏幕上看到的输出中获取标签。以下是我想从中抓取数据的 DOM 的快照:

<div class="bigContainer">
      <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
        <div class="fl">
          <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
          <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
          <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
              <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
           </grid-item>

我能够获得类为“bigContainer”的 div 标签，但我无法抓取该标签内的标签。例如，如果我想获得 grid-item 标签，我得到一个空列表，这意味着它表明没有这样的标签。为什么会这样？请帮忙!!

最佳答案

您可以使用底层 web-api 来提取由 angularJS javascript 框架呈现的网格项详细信息，因此 HTML 不是静态的。

一种解析方法是使用 selenium 获取数据，但使用浏览器的开发人员工具识别 web-api 非常简单。

编辑:我在 firefox 中使用 firebug 插件来查看从“网络选项卡”发出的 GET 请求

页面的 GET 请求是:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

并且它返回了一个回调JS脚本，几乎完全是JSON数据。

它返回的 JSON 包含网格项的详细信息

每个网格项都被描述为一个 json 对象，如下所示:

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

因此，您甚至可以通过以下方式在不使用 beautifulSoup 的情况下获取详细信息。

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

你会得到这样的输出

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

希望这对您有所帮助。

关于html - 使用 BeautifulSoup 抓取隐藏元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34546766/

24

4

0

文章推荐： keyboard - Elm 键盘组合快捷键

文章推荐： Angular2 ES6/ES2015 Babel - Angular 2 @ 符号的语法错误

beautifulsoup - BeautifulSoup 。如何获取包含特定单词的链接？
HTML 在 div 中包含字符串: 'div class="slide"' 'img src="xttps://site.com/files/r_1000,kljg894/43k5j/35h43jk
beautifulsoup - 使用 BeautifulSoup 从属性中提取 href
我用这个方法 allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}) 返回这样的列表: [掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶隆
beautifulsoup - 安装了 BeautifulSoup 但仍然没有得到名为 bs4 的模块
我正在使用 Jupyter 笔记本、Python 3.5 和虚拟环境。在我的虚拟环境中，我做了: (venv) > pip install BeautifulSoup4 这似乎运行良好 b/c 终端
python - 当我不使用 BeautifulSoup 时如何摆脱 BeautifulSoup html 解析器错误
我打算用 GUI 制作一个字典程序，但我在第一个障碍上就失败了。我刚刚安装了一个模块( PyDictionary )，但是当我运行以下代码时出现错误。 from PyDictionary import
python - Beautifulsoup 与 lxml vs Beautifulsoup 3
我正在将一些解析器从 BeautifulSoup3 迁移到 BeautifulSoup4，我认为考虑到 lxml 非常快并且它是我在 BS4 中使用的解析器，分析它会变得多快是个好主意，这里是分析结果
python - 来自 : can't read/var/mail/BeautifulSoup 的 BS4 和 BeautifulSoup 错误
这个问题在这里已经有了答案: Getting Python error "from: can't read /var/mail/Bio" (6 个答案) 关闭 11 个月前。 From Beauti
python - 从大文件中剥离 html 比 BeautifulSoup 更快/更少的资源破坏方式？或者，使用 BeautifulSoup 的更好方法？
目前我无法输入这个，因为根据 top，我的处理器是 100%，我的内存是 85.7%，都被 python 占用了。为什么？因为我让它通过一个 250 兆的文件来删除标记。 250兆，就是这样!我一直
Python Beautifulsoup : file. write(str) 方法获取 TypeError : write() argument must be str, 不是 BeautifulSoup
我写了下面的代码: from bs4 import BeautifulSoup import sys # where is the sys module in the source code fold
What causes `None` results from BeautifulSoup functions? How can I avoid "AttributeError: 'NoneType' object has no attribute..." with BeautifulSoup?(是什么原因导致了BeautifulSoup函数的`None‘结果？如何避免“AttributeError：‘NoneType’对象没有属性...”配上美人汤？)
通常，当我尝试使用BeautifulSoup解析网页时，BeautifulSoup函数会得到NONE结果，否则就会引发AttributeError。。以下是一些独立的(即，由于数据是硬编码的，不需要访
What causes `None` results from BeautifulSoup functions? How can I avoid "AttributeError: 'NoneType' object has no attribute..." with BeautifulSoup?(是什么原因导致了BeautifulSoup函数的`None‘结果？如何避免“AttributeError：‘NoneType’对象没有属性...”配上美人汤？)
通常，当我尝试使用BeautifulSoup解析网页时，BeautifulSoup函数会得到NONE结果，否则就会引发AttributeError。。以下是一些独立的(即，由于数据是硬编码的，不需要访
BeautifulSoup 嵌套类选择器
我正在为一个项目使用 BeautifulSoup。这是我的 HTML 结构 John Sam Bailey Jack
beautifulsoup - 从外部文件中删除多余的垃圾字符
这段代码正确地从我的博客中提取了马拉地语文本。我很欣赏使用漂亮的汤和正则表达式是多么容易。 from bs4 import BeautifulSoup import requests, re url
用于HTML解析的Python正则表达式(BeautifulSoup)
我想获取 HTML 中隐藏输入字段的值。我想用 Python 编写一个正则表达式，它将返回 fooId 的值，前提是我知道 HTML 中的行遵循以下格式有人可以提供一个 Python 示例来解
BeautifulSoup(bs4)细致讲解
BeautifulSoup(bs4) BeautifulSoup是python的一个库,最主要的功能是从网页爬取数据,官方是这样解释的:BeautifulSoup提供一些简单,python式函
Python Beautifulsoup 获取标签下的文字
我正在尝试获取特定月份的所有链接、标题和日期，例如网站上的三月，我正在使用 BeautifulSoup 这样做: from bs4 import BeautifulSoup import reques
python - Beautifulsoup 使用下一个按钮分页
我正试图通过此链接收集有关 2020 年世界上收入最高的运动员收入的信息 https://www.forbes.com/profile/roger-federer/?list=athletes这是第一
python - BeautifulSoup - 捕获特定类或文本的所有链接
我正在尝试从带有美丽汤的网页中捕获所有相关链接。我需要的所有链接都有 class="btn btn-gray"还有文字 More Info<> 仅提取这些链接的最佳方法是什么？最佳答案这个怎么样？
python - Beautifulsoup 带有下拉菜单的网页抓取网站
我正在尝试抓取一个具有下拉菜单的站点，用户可以在其中选择要显示的数据的年份。但是，我似乎被困在我的实现中。这是网站网址:https://www.pgatour.com/tournaments/mas
python - Beautifulsoup 网络爬虫问题
我正在使用 BeautifulSoup 和 mechanise 从网页中查找一些内容。问题是有时找不到我正在寻找的字符串。我不知道有什么问题对于许多网页，它可以正常工作数月，但突然停止工作。然后我必
python - BeautifulSoup 解析返回空集
( 更新代码就在下面) 我有一个类:UrlData , 生成一个 url 列表: for url in urls: rawMechSiteInfo = mech.open(url) #me

首页

博学

6Ren·AI

商城

html - 使用 BeautifulSoup 抓取隐藏元素