-6ren">
gpt4 book ai didi

html - 使用 BeautifulSoup 抓取隐藏元素

转载 作者:行者123 更新时间:2023-12-05 03:12:13 24 4
gpt4 key购买 nike

我试图从我的项目的网站上抓取数据。但问题是我没有在我的开发人员工具栏屏幕上看到的输出中获取标签。以下是我想从中抓取数据的 DOM 的快照:

<div class="bigContainer">
<!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
<div class="fl">
<!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
<div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
<grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
<a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
</grid-item>

我能够获得类为“bigContainer”的 div 标签,但我无法抓取该标签内的标签。例如,如果我想获得 grid-item 标签,我得到一个空列表,这意味着它表明没有这样的标签。为什么会这样?请帮忙!!

最佳答案

您可以使用底层 web-api 来提取由 angularJS javascript 框架呈现的网格项详细信息,因此 HTML 不是静态的。

一种解析方法是使用 selenium 获取数据,但使用浏览器的开发人员工具识别 web-api 非常简单。

编辑:我在 firefox 中使用 firebug 插件来查看从“网络选项卡”发出的 GET 请求

enter image description here

页面的 GET 请求是:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

并且它返回了一个回调JS脚本,几乎完全是JSON数据。

它返回的 JSON 包含网格项的详细信息

每个网格项都被描述为一个 json 对象,如下所示:

{
"product_id": 23491960,
"complex_product_id": 7287171,
"name": "Samsung Galaxy Z1 (Black)",
"short_desc": "",
"bullet_points": {
"salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
},
"url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
"seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
"url_type": "product",
"promo_text": null,
"image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
"vertical_id": 18,
"vertical_label": "Mobile",
"offer_price": 5090,
"actual_price": 5799,
"merchant_name": "SMARTBUY",
"authorised_merchant": false,
"stock": true,
"brand": "Samsung",
"tag": "+5% Cashback",
"product_tag": "+5% Cashback",
"shippable": true,
"created_at": "2015-09-17T08:28:25.000Z",
"updated_at": "2015-12-29T05:55:29.000Z",
"img_width": 400,
"img_height": 400,
"discount": "12"
}

因此,您甚至可以通过以下方式在不使用 beautifulSoup 的情况下获取详细信息。

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
print("Brand:", grid_item["brand"])
print("Product Name:", grid_item["name"])
print("Current Price: Rs", grid_item["offer_price"])
print("==================")

你会得到这样的输出

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

希望这对您有所帮助。

关于html - 使用 BeautifulSoup 抓取隐藏元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34546766/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com