gpt4 book ai didi

python - 如何从mysql中获取数据并在scrapy中使用spider从web中提取数据

转载 作者:行者123 更新时间:2023-11-29 02:58:33 25 4
gpt4 key购买 nike

我有一个蜘蛛和管道并编写代码以从网络中提取数据并插入到 MySQL哪个正在运行

class AmazonAllDepartmentSpider(scrapy.Spider):

name = "amazon"
allowed_domains = ["amazon.com"]
start_urls = [
"http://www.amazon.com/gp/site-directory/ref=nav_sad/187-3757581-3331414"
]
def parse(self, response):
for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
item = AmazoncrawlerItem()
# pop() removes [u''] tag from
item['title'] = sel.xpath('a/text()').extract().pop()
item['link'] = sel.xpath('a/@href').extract().pop()
item['desc'] = sel.xpath('text()').extract()
yield item

class AmazoncrawlerPipeline(object):
host = 'qwerty.com'
user = 'qwerty'
password = 'qwerty123'
db = 'amazon_project'

def __init__(self):
self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db)
self.cursor = self.connection.cursor()

def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO amazon_project.ProductDepartment (ProductTitle,ProductDepartmentLilnk)
VALUES (%s,%s)""",
(item['title'],'amazon.com' + str(item.get('link'))))


self.connection.commit()

except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item

现在我想获取那些数据(这是 URL 的链接)并再次调用spider从web中提取数据

请教我怎么做谢谢

最佳答案

应该在蜘蛛层面解决。

要跟随链接,您可以在yield一个项目实例之后yield一个Request:

def parse(self, response):
for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
item = AmazoncrawlerItem()
item['title'] = sel.xpath('a/text()').extract().pop()
item['link'] = sel.xpath('a/@href').extract().pop()
item['desc'] = sel.xpath('text()').extract()
yield item
yield Request(item['link'], callback=self.parse_link)

或者,您可以改变策略并切换到 Link Extractors .


UPD(在评论中讨论后):

如果数据库中已有链接,则需要启动另一个蜘蛛,从start_requests() 中的数据库读取链接。和 yield 请求:

from scrapy.http import Request

class AmazonSpider(scrapy.Spider):
name = "amazon"
allowed_domains = ["amazon.com"]

def start_requests(self):
connection = MySQLdb.connect(<connection params here>)
cursor = connection.cursor()

cursor.execute("SELECT ProductDepartmentLilnk FROM amazon_project.ProductDepartment")
links = cursor.fetchall()

for link in links:
yield Request(link, callback=self.parse)

cursor.close()

...

关于python - 如何从mysql中获取数据并在scrapy中使用spider从web中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27429617/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com