gpt4 book ai didi

python - 如何使用 scrapy.Request 将另一个页面的元素加载到项目中

转载 作者:搜寻专家 更新时间:2023-10-31 22:42:12 24 4
gpt4 key购买 nike

我已经使用 Scrapy 创建了一个网络抓取工具,它能够从这个 website 的每张票中抓取元素。但无法抓取票价,因为它在页面上不可用。当我尝试请求下一页来抓取价格时,我无法获取错误:exceptions.TypeError: 'XPathItemLoader' object has no attribute 'getitem'。我只能使用项目加载器来抓取任何元素,所以这就是我目前正在使用的,我不确定将另一个页面上抓取的元素传递给项目加载器的正确过程(我已经看到了一种方法项目数据类型,但它不适用于此处)。我想我可能在将元素提取到项目对象中时遇到问题,因为我正在通过管道传输到数据库中,但我不确定。如果可以修改我在下面发布的代码以便正确地抓取到下一页、抓取价格并将其添加到项目加载器,我认为问题应该得到解决。任何帮助将不胜感激。谢谢!

 class MySpider(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.vividseats.com"]
start_urls = [vs_url]
tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
def parse_price(self, response):
#First attempt at trying to load price into item loader
loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
print 'ticket price'
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):

loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader

loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()')
loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()')
loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()')
loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader["ticketsLink"]
request = scrapy.Request(ticketsURL , callback = self.parse_price)
yield loader.load_item()

最佳答案

要解决的关键问题:

  • 要从项目加载器获取值,请使用 get_output_value() , 替换:

    loader["ticketsLink"]

    与:

    loader.get_output_value("ticketsLink")
  • 您需要在请求的 meta 中传递 loader 并在那里产生/返回加载的项目

  • 构建获取价格的 URL 时,使用 urljoin()加入当前URL的相关部分

这里是固定版本:

from urlparse import urljoin
# other imports

class MySpider(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.vividseats.com"]
start_urls = [vs_url]
tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
def parse_price(self, response):
loader = response.meta['loader']
loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
return loader.load_item()

def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):

loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader

loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()')
loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()')
loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()')
loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader.get_output_value("ticketsLink")
ticketsURL = urljoin(response.url, ticketsURL)
yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price)

关于python - 如何使用 scrapy.Request 将另一个页面的元素加载到项目中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31215542/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com