gpt4 book ai didi

python - 无法使用 scrapy 蜘蛛抓取特定网站的元素

转载 作者:太空宇宙 更新时间:2023-11-03 18:19:23 25 4
gpt4 key购买 nike

我想获取一些工作的网站地址,所以我写了一个scrapy Spider,我想用xpath://article/dl/dd/h2/a[@class="获取所有值job-title"]/@href, 但是当我使用命令执行蜘蛛时:

scrapy spider auseek -a addsthreshold=3

用于保存值的变量“urls”为空,有人可以帮我算一下吗,

这是我的代码:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals

from myProj.items import ADItem
import time

class AuSeekSpider(CrawlSpider):
name = "auseek"
result_address = []
addressCount = int(0)
addressThresh = int(0)
allowed_domains = ["seek.com.au"]
start_urls = [
"http://www.seek.com.au/jobs/in-australia/"
]

def __init__(self,**kwargs):
super(AuSeekSpider, self).__init__()
self.addressThresh = int(kwargs.get('addsthreshold'))
print 'init finished...'

def parse_start_url(self,response):
print 'This is start url function'
log.msg("Pipeline.spider_opened called", level=log.INFO)
hxs = Selector(response)
urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()
print 'urls is:',urls
print 'test element:',urls[0].encode("ascii")
for url in urls:
postfix = url.getAttribute('href')
print 'postfix:',postfix
url = urlparse.urljoin(response.url,postfix)
yield Request(url, callback = self.parse_ad)

return


def parse_ad(self, response):
print 'this is parse_ad function'
hxs = Selector(response)

item = ADItem()
log.msg("Pipeline.parse_ad called", level=log.INFO)
item['name'] = str(self.name)
item['picNum'] = str(6)
item['link'] = response.url
item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))

self.addressCount = self.addressCount + 1
if self.addressCount > self.addressThresh:
raise CloseSpider('Get enough website address')
return item

问题是:

urls = hxs.xpath('//article/dl/dd/h2/a[@class="job-title"]/@href').extract()

当我尝试打印它时,urls 为空,我只是不明白为什么它不起作用以及如何纠正它,感谢您的帮助。

最佳答案

这是一个在下载处理程序中间件中使用 selenium 和 phantomjs headless 网络驱动程序的工作示例。

class JsDownload(object):

@check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

我希望能够告诉不同的蜘蛛使用哪个中间件,所以我实现了这个包装器:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
msg = '%%s %s middleware step' % (self.__class__.__name__,)
if self.__class__ in spider.middleware:
spider.log(msg % 'executing', level=log.DEBUG)
return method(self, request, spider)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return None

return wrapper

设置.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

要使包装器正常工作,所有蜘蛛必须至少具有:

middleware = set([])

包含中间件:

middleware = set([MyProj.middleware.ModuleName.ClassName])

您可以在请求回调(在蜘蛛中)中实现此功能,但随后 http 请求将发生两次。这不是一个完整的证明解决方案,但它适用于加载到 .ready() 上的内容。如果您花一些时间阅读 selenium,您可以在保存页面源代码之前等待特定事件触发。

另一个例子:https://github.com/scrapinghub/scrapyjs

更多信息:What's the best way of scraping data from a website?

干杯!

关于python - 无法使用 scrapy 蜘蛛抓取特定网站的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24423331/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com