gpt4 book ai didi

python - scrapy:蜘蛛中的一个小 "spider"?

转载 作者:太空宇宙 更新时间:2023-11-03 18:55:51 25 4
gpt4 key购买 nike

因此,当我尝试从 epinions.com 抓取产品评论信息时,如果主要评论文本太长,它会有一个指向另一个页面的“阅读更多”链接。我从“http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1”中举了一个例子,如果你看看第一篇评论,你就会明白我的意思。

我想知道:是否有可能在 for 循环的每次迭代中都有一个小蜘蛛来抓取 url 并从新链接中抓取掉评论?我有以下代码,但它不适用于小“蜘蛛”。

这是我的代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from epinions_test.items import EpinionsTestItem
from scrapy.http import Response, HtmlResponse

class MySpider(BaseSpider):
name = "epinions"
allow_domains = ["epinions.com"]
start_urls = ['http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1']

def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="review_info"]')

items = []
for sites in sites:
item = EpinionsTestItem()
item["title"] = sites.select('h2/a/text()').extract()
item["star"] = sites.select('span/a/span/@title').extract()
item["date"] = sites.select('span/span/span/@title').extract()
item["review"] = sites.select('p/span/text()').extract()
# Everything works fine and i do have those four columns beautifully printed out, until....

url2 = sites.select('p/span/a/@href').extract()
url = str("http://www.epinions.com%s" %str(url2)[3:-2])
# This url is a string. when i print it out, it's like "http://www.epinions.com/review/samsung-galaxy-note-16-gb-cell-phone/content_624031731332", which looks legit.

response2 = HtmlResponse(url)
# I tried in a scrapy shell, it shows that this is a htmlresponse...

hxs2 = HtmlXPathSelector(response2)
fullReview = hxs2.select('//div[@class = "user_review_full"]')
item["url"] = fullReview.select('p/text()').extract()
# The three lines above works in an independent spider, where start_url is changed to the url just generated and everything.
# However, i got nothing from item["url"] in this code.

items.append(item)
return items

为什么 item["url"] 不返回任何内容?

谢谢!

最佳答案

您应该在回调中实例化一个新的Request,并在meta 字典中传递您的item:

from scrapy.http import Request
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class EpinionsTestItem(Item):
title = Field()
star = Field()
date = Field()
review = Field()


class MySpider(BaseSpider):
name = "epinions"
allow_domains = ["epinions.com"]
start_urls = ['http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1']

def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="review_info"]')

for sites in sites:
item = EpinionsTestItem()
item["title"] = sites.select('h2/a/text()').extract()
item["star"] = sites.select('span/a/span/@title').extract()
item["date"] = sites.select('span/span/span/@title').extract()

url = sites.select('p/span/a/@href').extract()
url = str("http://www.epinions.com%s" % str(url)[3:-2])

yield Request(url=url, callback=self.parse_url2, meta={'item': item})

def parse_url2(self, response):
hxs = HtmlXPathSelector(response)

item = response.meta['item']
fullReview = hxs.select('//div[@class = "user_review_full"]')
item["review"] = fullReview.select('p/text()').extract()
yield item

另请参阅documentation .

希望有帮助。

关于python - scrapy:蜘蛛中的一个小 "spider"?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17283423/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com