gpt4 book ai didi

python - Scrapy:爬取但不被抓取

转载 作者:太空宇宙 更新时间:2023-11-03 11:07:42 27 4
gpt4 key购买 nike

根据提供的建议和大量跟踪,我能够完成单页的抓取工作。现在,我尝试修改代码,实现多条规则,但效果并不理想。这是我正在尝试做的事情的简要说明,

对于 start_url=ttp://sfbay.craigslist.org/- 我使用 parse_items_1 来识别 http://sfbay.craigslist.org/npo并解析相同的内容以识别链接

在级别 2 中,对于 ttp://sfbay.craigslist.org/npo 中的链接,我需要使用 parse_items_2 来识别类似 http://sfbay.craigslist.org/npo/index100.html 的链接。并解析相同的内容。

蜘蛛能够抓取(我可以看到显示),但链接没有被废弃。

2013-02-13 11:23:55+0530 [craigs] DEBUG: Crawled (200) <GET http://sfbay.craigslist.org/npo/index100.html> (referer: http://sfbay.craigslist.org/npo/)
('**parse_items_2:', [u'Development Associate'], [u'http://sfbay.craigslist.org/eby/npo/3610841951.html'])
('**parse_items_2:', [u'Resource Development Assistant'], [u'http://sfbay.craigslist.org/eby/npo/3610835088.html'])

但链接和标题在废弃时为空。

2013-02-13 11:23:55+0530 [craigs] DEBUG: Scraped from <200 http://sfbay.craigslist.org/npo/index100.html>
{'link': [], 'title': []}
2013-02-13 11:23:55+0530 [craigs] DEBUG: Scraped from <200 http://sfbay.craigslist.org/npo/index100.html>
{'link': [], 'title': []}

代码详情:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from myspider.items import CraigslistSampleItem


class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/"]

rules = (
Rule(SgmlLinkExtractor(allow=("index\d00\.html")), callback="parse_items_2", follow= True),
Rule(SgmlLinkExtractor(allow=(r'sfbay.craigslist.org/npo')), callback="parse_items_1", follow= True),
)

def __init__(self, *a, **kw):
super(MySpider, self).__init__(*a, **kw)
self.items = []
self.item = CraigslistSampleItem()

def parse_items_1(self, response):
# print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div")
for title in titles:
self.item ["title"] = title.select("//li/a/text()").extract()
self.item ["link"] = title.select("//li/a/@href").extract()
print ('**parse-items_1:', self.item["title"])
self.items.append(self.item)
return self.items

def parse_items_2(self, response):
# print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
for title in titles:
self.item ["title"] = title.select("a/text()").extract()
self.item ["link"] = title.select("a/@href").extract()
print ('**parse_items_2:', self.item["title"], self.item["link"])
self.items.append(self.item)
return self.items

非常感谢任何帮助!

谢谢。

最佳答案

在 scrapy 教程中,项目是在回调中创建的,然后返回以进一步向下传递,而不是绑定(bind)到蜘蛛类的实例。因此,删除 init 部分并重写一些回调代码似乎可以解决问题。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import CraigslistSampleItem

class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/"]

rules = (
Rule(SgmlLinkExtractor(allow=("index\d00\.html")), callback="parse_items_2", follow= True),
Rule(SgmlLinkExtractor(allow=(r'sfbay.craigslist.org/npo')), callback="parse_items_1", follow= True),
)

def parse_items_1(self, response):
items = []
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div")
for title in titles:
item = CraigslistSampleItem()
item ["title"] = title.select("//li/a/text()").extract()
item ["link"] = title.select("//li/a/@href").extract()
print ('**parse-items_1:', item["title"])
items.append(item)
return items

def parse_items_2(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for title in titles:
item = CraigslistSampleItem()
item ["title"] = title.select("a/text()").extract()
item ["link"] = title.select("a/@href").extract()
print ('**parse_items_2:', item["title"], item["link"])
items.append(item)
return items

为了测试,我将抓取的项目转储到一个文件中(scrapy crawl craigs -t json -o items.json)。我注意到经常有空条目和很多“使用条款”链接。这些表明您的提取 xpaths 可以收紧,但除此之外它似乎正在工作。

关于python - Scrapy:爬取但不被抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14847366/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com