gpt4 book ai didi

Python Scrapy Parse 提取的链接与另一个函数

转载 作者:行者123 更新时间:2023-12-04 23:39:12 25 4
gpt4 key购买 nike

我是scrapy的新手,我正在尝试抓取黄页以进行学习,一切正常,但我想要电子邮件地址,但要做到这一点,我需要访问在 parse 中提取的链接并使用另一个 parse_email 函数对其进行解析,但它不工作。

我的意思是我测试了 parse_email 函数它可以工作但它在主解析函数内部不起作用,我希望 parse_email 函数获取链接的源,所以我使用回调调用 parse_email 函数但它只返回这样的链接<GET https://www.yellowpages.com/los-angeles-ca/mip/palm-tree-la-7254813?lid=7254813> 出于某种原因,它应该返回电子邮件的地方 parse_email 函数不起作用,只是返回链接而不打开页面

这是我评论部分的代码

import scrapy
import requests
from urlparse import urljoin

scrapy.optional_features.remove('boto')

class YellowSpider(scrapy.Spider):
name = 'yellow spider'
start_urls = ['https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Los+Angeles%2C+CA']

def parse(self, response):
SET_SELECTOR = '.info'
for brickset in response.css(SET_SELECTOR):

NAME_SELECTOR = 'h3 a ::text'
ADDRESS_SELECTOR = '.adr ::text'
PHONE = '.phone.primary ::text'
WEBSITE = '.links a ::attr(href)'


#Getiing the link of the page that has the email usiing this selector
EMAIL_SELECTOR = 'h3 a ::attr(href)'

#extracting the link
email = brickset.css(EMAIL_SELECTOR).extract_first()

#joining and making complete url
url = urljoin(response.url, brickset.css('h3 a ::attr(href)').extract_first())



yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
'phone': brickset.css(PHONE).extract_first(),
'website': brickset.css(WEBSITE).extract_first(),

#ONLY Returning Link of the page not calling the function

'email': scrapy.Request(url, callback=self.parse_email),
}

NEXT_PAGE_SELECTOR = '.pagination ul a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract()[-1]
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

def parse_email(self, response):

#xpath for the email address in the nested page

EMAIL_SELECTOR = '//a[@class="email-business"]/@href'

#returning the extracted email WORKS XPATH WORKS I CHECKED BUT FUNCTION NOT CALLING FOR SOME REASON
yield {
'email': response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
}

我不知道我做错了什么

最佳答案

你正在产生一个 dictRequest在它内部,Scrapy 不会调度它,因为它不知道它在那里(它们在创建后不会被自动调度)。您需要产生实际 Request .

parse_email函数,为了“记住”每封电子邮件属于哪个项目,您需要将其余项目数据与请求一起传递。您可以使用 meta 执行此操作争论。

例子:

parse :

yield scrapy.Request(url, callback=self.parse_email, meta={'item': {
'name': brickset.css(NAME_SELECTOR).extract_first(),
'address': brickset.css(ADDRESS_SELECTOR).extract_first(),
'phone': brickset.css(PHONE).extract_first(),
'website': brickset.css(WEBSITE).extract_first(),
}})

parse_email :
item = response.meta['item']  # The item this email belongs to
item['email'] = response.xpath(EMAIL_SELECTOR).extract_first().replace('mailto:', '')
return item

关于Python Scrapy Parse 提取的链接与另一个函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42769246/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com