gpt4 book ai didi

javascript - 无法在 BeautifulSoup 中抓取一些细节

转载 作者:行者123 更新时间:2023-11-28 17:47:18 24 4
gpt4 key购买 nike

我正在使用 BeautifulSoup 来获取数据,除了一件事之外,一切都在我的代码中运行,那就是价格。我正在尝试抓取一个房地产网站,但无法抓取价格。网站是“https://www.proptiger.com/all-projects

下面是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import time
import json
import io
url = "https://www.proptiger.com/all-projects"
# for all pages https://www.proptiger.com/all-projects?page=2
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
container = soup.find_all("section", {"class":"project-card-main-wrapper"})
print(len(container))

newFile = "Prop_Data.csv"
f = open(newFile, "w", encoding = "utf-8")
Headers = "Project, Url, City, Builder, Price\n"
f.write(Headers)
#f.close()

for i in container:
contain = i.find_all("div", {"class":"proj-name"})
project_name = contain[0]['title']
url2 = i.div['data-url']
url1 = "https://www.proptiger.com"
url = url1+url2
get_city = i.find_all("span", {"itemprop":"address"})#or by div, {"class":"loc"}
city = get_city[0]["title"]# or by getcity.text
builder = i.find_all("div", {"class":"projectBuilder put-ellipsis"})
bName = builder[0].text
price = i.find_all("div", {"class":"project-price"})
pricereal = price[0].text#not able to print the print says list out of index
print(pricereal)
#f.write("{}".format(project_name) +",{}".format(url)+",{}".format(city)+",{}".format(bName)+"\n")
#f.close()

现在,每当我运行此代码时,它都会说列表超出范围。

下面是价格的html:

<div class="project-price" itemscope="" itemtype="https://schema.org/PriceSpecification"><span itemprop="minPrice">₹ 32.4 L</span><span itemprop="maxPrice">- ₹ 88.0 L</span>
<!-- -if(project.avgPricePerUnitArea)div.text-right.price-perunit &#8377; / sq ft-->
</div>

我想要最低价格和最高价格,所以我发短信并获取 56=-6 件商品的价格,然后列出超出范围的价格。有人可以我做错了吗?

最佳答案

您没有获得价格,因为它是在 javascript 中。不要因为看到所有其他项目都被打印而价格却没有被打印而感到困惑。因此,为了解决这个问题,您可以将 selenium 与 BeautifulSoup 结合使用。

我在这里使用了代码的必要部分:

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://www.proptiger.com/all-projects")
time.sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

for item in soup.find_all("section", {"class":"project-card-main-wrapper"}):
price = item.select(".project-price")[0].text if item.select(".project-price") else ""
print(price)

部分结果:

₹ 32.4 L- ₹ 88.0 L
₹ 33.6 L- ₹ 51.0 L
₹ 62.0 L- ₹ 1.25 Cr
₹ 49.9 L- ₹ 1.32 Cr
₹ 35.0 L- ₹ 50.0 L

为了让事情更清楚,请参阅以下内容:

>>> import requests
>>> link = "https://www.proptiger.com/all-projects"
>>> page = requests.get(link).text
>>> 'Umang Premiere' in page
True
>>> '₹ 35.0 L' in page
False
>>>

我是在 python IDE 中完成的。正如您所看到的,找到了产品名称,但没有找到价格。这是因为JavaScript。希望这是有道理的。

关于javascript - 无法在 BeautifulSoup 中抓取一些细节,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46384839/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com