gpt4 book ai didi

javascript - 无法使用 Python 抓取网页

转载 作者:可可西里 更新时间:2023-11-01 16:36:26 25 4
gpt4 key购买 nike

我已经发布了一个类似的 question前。我试图抓取 web page使用以下方法

import requests

url = 'https://www.zameen.com/'
res = requests.get(url)
data = res.text
print(data)

它的回复说我是 BOT 或 Javascript 未启用。所以,我检查过,但 Javascript 已启用。所以我尝试了另一种使用假用户代理的方法,代码如下

from fake_useragent import UserAgent
headers = {}
headers['User-Agent'] = str(ua.chrome)
web_page = requests.get(url,headers=headers)
print(web_page.content)

响应:

b'<!DOCTYPE html>\n\n\t\n\n\t\n\t\n\t\n\n\t\n\t\n\n\t\n\t\n\t\n\n<head>\n<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">\n<meta http-equiv="cache-control" content="max-age=0" />\n<meta http-equiv="cache-control" content="no-cache" />\n<meta http-equiv="expires" content="0" />\n<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />\n<meta http-equiv="pragma" content="no-cache" />\n<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/&amp;distil_RID=053235A2-0030-11E7-8429-B03805AB611E&amp;distil_TID=20170303163950" />\n<script type="text/javascript">\n\t(function(window){\n\t\ttry {\n\t\t\tif (typeof sessionStorage !== \'undefined\'){\n\t\t\t\tsessionStorage.setItem(\'distil_referrer\', document.referrer);\n\t\t\t}\n\t\t} catch (e){}\n\t})(window);\n</script>\n<script type="text/javascript" src="/ga368490.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#caexxxzxycbzutyvy{display:none!important}</style></head>\n<body>\n<div id="distil_ident_block">&nbsp;</div>\n</body>\n</html>\n'

它再次检测到我是机器人。所以我检查了我是否可以从网站上获取数据。然后我使用了 urllib 中的 robotparser

from urllib import robotparser

req = robotparser.RobotFileParser()
req.set_url(url)
req.read()
print(req.can_fetch('*','https://www.zameen.com/'))

返回:

TRUE # Means I can fetch the data from the website. 

有没有办法从这个网页获取数据?谢谢

最佳答案

您可以为此使用 BeautifulSoup 和 Selenium 驱动程序。我成功地从您提供的 URL 中获取了页面源代码:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox() # Could be any other browser you have the drivers for
driver.get('https://zameen.com')
html = driver.page_source
code = BeautifulSoup(html, 'html5lib')
print code

只是不要忘记安装 bs4 和 Selenium:

pip install bs4

pip install selenium

关于javascript - 无法使用 Python 抓取网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42584357/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com