gpt4 book ai didi

python - 如何让 start_urls 在 scrapy 中获取另一个 python 函数生成的 url?

转载 作者:行者123 更新时间:2023-11-28 21:55:20 25 4
gpt4 key购买 nike

这是我从 ebay 获取商品 url 的代码,即 link3:

def url_soup(url):

source=(urllib2.urlopen(url)).read()
soup=BeautifulSoup(source)
link=soup.select('a.ListItemLink')
for links in link:
link3=('http://www.ebay.com/'+'%s') % (links['href'])


Dept={"All Departments":"0","Apparel":"5438","Auto":"91083","Baby":"5427","Beauty":"1085666",
"Books":"3920","Electronics":"3944","Gifts":"1094765","Grocery":"976759","Health":"976760",
"Home":"4044","Home Improvement":"1072864","Jwelery":"3891","Movies":"4096","Music":"4104",
"Party":"2637","Patio":"5428","Pets":"5440","Pharmacy":"5431","Photo Center":"5426",
"Sports":"4125","Toys":"4171","Video Games":"2636"}

def gen_url(keyword,domain):

if domain in Dept.keys():
main_url=('http://www.ebay.com/search/search-ng.do?search_query='+'%s'+'&ic=16_0&Find=Find&search_constraint='+'%s') % (keyword,Dept.get(domain))
url_soup(main_url)

gen_url('Bags','Apparel')

现在我希望我的蜘蛛每次都选择 start_urls 作为 link3。附言我是 scrapy 的新手!!

最佳答案

你需要定义start_requests()动态定义蜘蛛开始的 url 的方法。

例如,你应该有这样的东西:

from scrapy.http import Request
from scrapy.selector import Selector
from scrapy.spider import BaseSpider


class MySpider(BaseSpider):
name = "my_spider"
domains = ['Auto']
departments = {"All Departments": "0", "Apparel": "5438", "Auto": "91083", "Baby": "5427", "Beauty": "1085666",
"Books": "3920", "Electronics": "3944", "Gifts": "1094765", "Grocery": "976759", "Health": "976760",
"Home": "4044", "Home Improvement": "1072864", "Jwelery": "3891", "Movies": "4096", "Music": "4104",
"Party": "2637", "Patio": "5428", "Pets": "5440", "Pharmacy": "5431", "Photo Center": "5426",
"Sports": "4125", "Toys": "4171", "Video Games": "2636"}
keyword = 'Auto'

allowed_domains = ['ebay.com']

def start_requests(self):
for domain in self.domains:
if domain in self.departments:
url = 'http://www.ebay.com/search/search-ng.do?search_query=%s&ic=16_0&Find=Find&search_constraint=%s' % (self.keyword, self.departments.get(domain))
print "YIELDING"
yield Request(url)

def parse(self, response):
print "IN PARSE"
sel = Selector(response)
links = sel.select('//a[@class="ListItemLink"]/@href')
for link in links:
href = link.extract()[0]
yield Request('http://www.ebay.com/' + href, self.parse_data)

def parse_data(self, response):
# do your actual crawling here
print "IN PARSE DATA"

希望对您有所帮助。

关于python - 如何让 start_urls 在 scrapy 中获取另一个 python 函数生成的 url?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22807236/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com