gpt4 book ai didi

python - 如何使用 beautifulsoup 打印 ... 标签之间写入的文本并将另一个属性与该文本关联

转载 作者:太空宇宙 更新时间:2023-11-03 18:41:22 26 4
gpt4 key购买 nike

我正在尝试抓取一个旅游网站 agoda.com。我正在使用 Selenium 和 beautifulsoup。我可以到达需要抓取酒店名称和价格的页面。我也刮过了。但问题是我正在获取带有标签的所有值输出:泰姬陵宫殿

如何仅获取 anchor 标记之间的文本

我还刮掉了价格,但也在标签中。但我不知道如何将酒店名称和价格打印在一起,例如 The Taj Mahal Palace,USD 219。

请帮忙

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import unittest, time, re
import time
from bs4 import BeautifulSoup
import urllib2
import sys;
reload(sys);
sys.setdefaultencoding("utf8")


class Agoda(CrawlSpider):
name = 'agoda'
allowed_domains = ["agoda.com"]
start_urls = ["http://www.agoda.com"]
driver = webdriver.Firefox()
driver.get("http://www.agoda.com")
driver.find_element_by_id("ctl00_ctl00_MainContent_area_promo_HomeSearchBox1_TextSearch1_searchText").clear()
driver.find_element_by_id("ctl00_ctl00_MainContent_area_promo_HomeSearchBox1_TextSearch1_searchText").send_keys("Mumbai")
driver.find_element_by_xpath("//select[contains(@id,'ddlCheckInDay')]")
driver.find_element_by_xpath("//option[contains(.,'Mon 09')]").click()
driver.find_element_by_id("ctl00_ctl00_MainContent_area_promo_HomeSearchBox1_SearchButton").click()
driver.find_element_by_id("ctl00_ContentMain_rptAB1936_ctl01_rptSearchResultAB1936_ctl01_lnkResult1936").click()
time.sleep(20);
#print driver.page_source
TotalResults = driver.find_element_by_xpath("//span[@class='blue ssr_search_text']")
print TotalResults.text

html_source = driver.page_source
soup = BeautifulSoup(html_source)


names = soup("a", {"class":"hot_name"})

#comments = soup("div", {"class":"mbluebold col_hotelinfo_name"}, text = True)
#comments[0].Contents()
#print comments
#tags = soup.find_all("a")
for name in enumerate(names):
print name

prices = soup("span", {"class":"fontxlargeb purple"})
for price in enumerate(prices):
print price

最佳答案

try the get_text() method on the 'a' tags(or any tags)

for instance if html is simply "<a href="alisejflai">hello</a>"

soup = BeautifulSoup(html)

soup.get_text() is 'hello'`

编辑:

关于您的评论:enumerate(names) 将生成以下形式的元组:

(0, <a class="hot_name"> howdy pardner</a>)
(1, <a class="hot_name">againagain</a>)

由于您只想在实际的“a”标签上调用 get_text(),因此您需要执行以下操作:

for name in names:
name.get_text() # no tuple involved

或者如果由于某种原因必须使用枚举:

for name in enumerate(names):
name[1].get_text() # accessing just the a tag within the tuple.

编辑:

如果您想将酒店名称和价格“成对”放置,您可以将我上面的第一个编辑替换为以下内容:

这些列表推导式更加 Pythonic,我相信比 for 循环更快:

hotel_names = [name.get_text() for name in names] #or [name[1].get_text() for name in enumerate(names)]
prices = [price.get_text() for prices in prices] #[price[1].get_text() for price in enumerate(prices)]

name_price_list = zip(hotel_names, prices)

for name, price in name_price_list:
print name, price

打印输出:

name price
name price
name price etc.

请告诉我这是否适合您

关于python - 如何使用 beautifulsoup 打印 <a>...</a> 标签之间写入的文本并将另一个属性与该文本关联,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20444922/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com