gpt4 book ai didi

python - 在进一步的 url 中抓取

转载 作者:太空宇宙 更新时间:2023-11-04 03:34:27 25 4
gpt4 key购买 nike

所以我有一个爬虫,可以很好地提取有关演出的信息。然而,在我抓取的信息中,有一个 url 显示了有关所列演出的更多信息,例如音乐风格。我如何在该 url 中抓取并继续抓取其他所有内容?

这是我的代码。非常感谢任何帮助。

import scrapy # Import required libraries.
from scrapy.selector import HtmlXPathSelector # Allows for path detection in a websites code.
from scrapy.spider import BaseSpider # Used to create a simple spider to extract data.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor # Needed for the extraction of href links in HTML to crawl further pages.
from scrapy.contrib.spiders import CrawlSpider # Needed to make the crawl spider.
from scrapy.contrib.spiders import Rule # Allows specified rules to affect what the link
from urlparse import urlparse
import soundcloud
import mysql.connector
import requests
import time
from datetime import datetime

from tutorial.items import TutorialItem

genre = ["Dance",
"Festivals",
"Rock/pop"
]

class AllGigsSpider(CrawlSpider):
name = "allGigs" # Name of the Spider. In command promt, when in the correct folder, enter "scrapy crawl Allgigs".
allowed_domains = ["www.allgigs.co.uk"] # Allowed domains is a String NOT a URL.
start_urls = [
#"http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
#"http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
"http://www.allgigs.co.uk/whats_on/London/tours-65.html"
]

rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), # Search the start URL's for
callback="parse_item",
follow=True),
]

def parse_start_url(self, response):#http://stackoverflow.com/questions/15836062/scrapy-crawlspider-doesnt-crawl-the-first-landing-page
return self.parse_item(response)

def parse_item(self, response):
for info in response.xpath('//div[@class="entry vevent"]'):
item = TutorialItem() # Extract items from the items folder.
item ['table'] = "London"
item ['url'] = info.xpath('.//a[@class="url"]/@href').extract()
print item ['url']
item ['genres'] = info.xpath('.//li[@class="style"]//text() | ./parent::a[@class="url"]/preceding-sibling::li[@class="style"]//text').extract()
print item ['genres']
item ['artist'] = info.xpath('.//span[@class="summary"]//text()').extract() # Extract artist information.
item ['venue'] = info.xpath('.//span[@class="vcard location"]//text()').extract() # Extract artist information.
item ['borough'] = info.xpath('.//span[@class="adr"]//text()').extract() # Extract artist information.
item ['date'] = info.xpath('.//span[@class="dates"]//text()').extract() # Extract date information.
a, b, c = item["date"][0].split()
item['dateForm']=(datetime.strptime("{} {} {} {}".format(a,b.rstrip("ndthstr"),c,"2015"),"%a %d %b %Y").strftime("%Y,%m,%d"))
preview = ''.join(str(s)for s in item['artist'])
item ['genre'] = info.xpath('.//div[@class="header"]//text() | ./parent::div[@class="rows"]/preceding-sibling::div[@class="header"]//text()').extract()
client = soundcloud.Client(client_id='401c04a7271e93baee8633483510e263', client_secret='b6a4c7ba613b157fe10e20735f5b58cc', callback='http://localhost:9000/#/callback.html')
tracks = client.get('/tracks', q = preview, limit=1)
for track in tracks:
print track.id
item ['trackz'] = track.id
yield item

a[@class="url"] 是我想要进入的。 li[@class="style"] 包含我在 url 中需要的信息。非常感谢

这是最新情况。我在这里尝试的代码会产生断言错误。对此有点困惑...

    item ['url'] = info.xpath('.//a[@class="url"]/@href').extract()
item ['url'] = ''.join(str(t) for t in item['url'])
yield Request (item['url'], callback='continue_item', meta={'item': item})

def countinue_item(self, response):
item = response.meta.get('item')
item['genres']=info.xpath('.//li[@class="style"]//text()').extract()
print item['genres']
return self.parse_parse_item(response)

我使用 .join 函数将 item['url'] 更改为字符串。然后在 continue_item 中,我在 url 中抓取(或者至少它应该是!)并返回结果。但是如前所述,尚未正常工作。不要觉得太远

最佳答案

你需要用新的方法继续抓取它,比如:

from scrapy.http import Request
...
def parse_item(self, response):
...
yield Request(item['url'], callback=self.continue_item, meta={'item': item})

def continue_item(self, response):
item = response.meta.get('item')
...
yield item

关于python - 在进一步的 url 中抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29779518/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com