gpt4 book ai didi

Python,bs4 : Tags in inspection are nowhere to be found when parsing

转载 作者:太空宇宙 更新时间:2023-11-03 16:12:13 27 4
gpt4 key购买 nike

我遇到了一个意想不到的问题,我正在使用Python 3.5和BeautifulSoup。我想解析以下链接:

url = 'https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s'
import requests, bs4
res = requests.get(url)
res.raise_for_status()
DicoSoup = bs4.BeautifulSoup(res.text, "lxml")

我有兴趣检索优惠中图片的链接。当我检查网站的html时,我发现在带有“thumbnails”类的标签div下可以找到它们,它们在带有“item_imagePic”类的标签span下,它们是img标签

但是,当我选择 div 标签时,span 标签就找不到了:

div = DicoSoup.select("div.thumbnails")

div
Out[54]:
[<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5">
<ul>
<li class="thumb selected trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_0"></li>
<li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_1"> </li>
<li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_2"></li>
</ul>
</div>]

当我检查 html 内容时,我看到的是:

<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5" style="width: 596px;">
<ul style="">

<li id="thumb_0" class="thumb selected trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

<li id="thumb_1" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

<li id="thumb_2" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

</ul>
</div>

这怎么可能?我需要做什么才能选择它们?

我已经尝试过:

div = DicoSoup.select_one("div.thumbnails span.item_imagePic")
div = DicoSoup.select_one("div.thumbnails ul li span.item_imagePic")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic")
span = DicoSoup.find('span', {'class': 'item_imagePic'})
span = DicoSoup.find('span',id="thumb_0")
div = DicoSoup.select("div.thumbnails img")
div = DicoSoup.select("div.thumbnails span img")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic img")

它们都返回“NoneType”类型的对象

谢谢

最佳答案

正如我所评论的,缩略图是使用 JS 动态生成的,但您可以获取脚本并解析路径:

soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script")
print(script.text.strip())

这给你:

var images = new Array(), images_thumbs = new Array();
images_thumbs[0] = "//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg";
images[0] = "//img0.leboncoin.fr/images/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg";

images_thumbs[1] = "//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg";
images[1] = "//img1.leboncoin.fr/images/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg";

images_thumbs[2] = "//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg";
images[2] = "//img2.leboncoin.fr/images/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg";

获取图像链接:

import re


soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script").text

print(re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script))

或者只是分割线和 strip :

 [s.split("=", 1)[1].strip('"; ') for s in script.splitlines() if s.strip().startswith("images_thumbs")]

两者都给你:

[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']
[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']

最后您需要做的就是在前面加上 https 方案:

 ["https://"+ path for path in re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script)]

关于Python,bs4 : Tags in inspection are nowhere to be found when parsing,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39180183/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com