gpt4 book ai didi

Python 和 BeautifulSoup 4/Selenium - 无法从 kicksusa.com 获取数据?

转载 作者:行者123 更新时间:2023-11-30 21:57:58 25 4
gpt4 key购买 nike

我正在尝试从 kicksusa.com 抓取数据,但遇到了一些问题。

当我尝试基本的 BS4 方法时,如下所示(导入是从使用所有这些的主程序复制/粘贴的):

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

data1 = requests.get('https://www.kicksusa.com/')
soup1 = BeautifulSoup(data1.text, 'html.parser')

button = soup1.find('span', attrs={'class': 'shop-btn'}).text.strip()
print(button)

结果是“None”,这告诉我数据是通过JS隐藏的。所以,我尝试使用 Selenium,如下所示:

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get('https://www.kicksusa.com/')
url = driver.find_element_by_xpath("//span[@class='shop-btn']").text
print(url)
driver.close()

我得到“无法找到元素”。

有人知道如何使用 BS4 或 Selenium 抓取该网站吗?预先感谢您!

最佳答案

问题是您被检测为机器人并得到如下响应:

<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=5-36224256-0%200NNN%20RT%281552245394179%20277%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B15%2811%2c110765%2c0%29%20U2&incident_id=314001710050302156-195663432827669173&edet=15&cinfo=0b000000"
frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula
incident ID: 314001710050302156-195663432827669173
</iframe>
</body>
</html>
<小时/>

请求和 BeautifulSoup

如果您想使用 requestsbs,请从浏览器开发工具复制请求中的 visid_incap_incap_ses_ cookie header 到 www.kicksusa.com 并在您的请求中使用它们:

import requests
from bs4 import BeautifulSoup

headers = {
'Host': 'www.kicksusa.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/72.0.3626.121 Safari/537.36',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
'Cookie': 'visid_incap_...=put here your visid_incap_ value; incap_ses_...=put here your incap_ses_ value',
}

response = requests.get('https://www.kicksusa.com/', headers=headers)

page = BeautifulSoup(response.content, "html.parser")

shop_buttons = page.select("span.shop-btn")
for button in shop_buttons:
print(button.text)

print("the end")
<小时/>

Selenium

当你运行 Selenium 时有时你会得到相同的响应: enter image description here

重新加载页面对我有用。尝试下面的代码:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.kicksusa.com/')

if len(driver.find_elements_by_css_selector("[name=ROBOTS]")) > 0:
driver.get('https://www.kicksusa.com/')

shop_buttons = driver.find_elements_by_css_selector("span.shop-btn")
for button in shop_buttons:
print(button.text)

关于Python 和 BeautifulSoup 4/Selenium - 无法从 kicksusa.com 获取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55089759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com