gpt4 book ai didi

python - Django 与 Scrapy 的关系如何保存项目?

转载 作者:太空狗 更新时间:2023-10-30 01:32:27 26 4
gpt4 key购买 nike

我只需要了解如何检测 scrapy 是否已保存和 item 在 spider 中?我正在从一个站点获取项目,然后我正在获取对该项目的评论。所以首先我必须保存该项目,然后我将保存评论。但是当我在 yield 之后编写代码时,它给了我这个错误。

禁止save(),防止相关对象''未保存导致数据丢失。

这是我的代码

def parseProductComments(self, response):

name = response.css('h1.product-name::text').extract_first()
price = response.css('span[id=offering-price] > span::text').extract_first()
node = response.xpath("//script[contains(text(),'var utagData = ')]/text()")
data = node.re('= (\{.+\})')[0] #data = xpath.re(" = (\{.+\})")
data = json.loads(data)

barcode = data['product_barcode']

objectImages = []
for imageThumDiv in response.css('div[id=productThumbnailsCarousel]'):
images = imageThumDiv.xpath('img/@data-src').extract()
for image in images:
imageQuality = image.replace('/80/', '/500/')
objectImages.append(imageQuality)
company = Company.objects.get(pk=3)
comments = []
item = ProductItem(name=name, price=price, barcode=barcode, file_urls=objectImages, product_url=response.url,product_company=company, comments = comments)
yield item
print item["pk"]
for commentUl in response.css('ul.chevron-list-container'):

url = commentUl.css('span.link-more-results::attr(href)').extract_first()
if url is not None:
for commentLi in commentUl.css('li.review-item'):
comment = commentLi.css('p::text').extract_first()
commentItem = CommentItem(comment=comment, product=item.instance)

yield commentItem
else:

yield scrapy.Request(response.urljoin(url), callback=self.parseCommentsPages, meta={'item': item.instance})

这是我的管道。

def comment_to_model(item):
model_class = getattr(item, 'Comment')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")

def get_comment_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product=model.product, comment=model.comment)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()

return (obj, created)

def get_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product_company=model.product_company, barcode=model.barcode)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()

return (obj, created)


def update_model(destination, source, commit=True):
pk = destination.pk

source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)

setattr(destination, 'pk', pk)

if commit:
destination.save()
return destination


class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']
item_model = item.instance
model, created = get_or_create(item_model)
#update_model(model, item_model)

if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
# for comment in item['comments']:
# commentItem = CommentItem(comment=comment, product= item.instance)
# commentItem.save()
return item
if isinstance(item, CommentItem):
comment_to_model = item.instance
model, created = get_comment_or_create(comment_to_model)
if created:
print model
else:
print created
return item

最佳答案

获取或创建

您的大部分代码似乎都在处理 get_or_create 的明显弱点

# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.

幸运的是,这个明显的缺点很容易克服。感谢get_or_create的默认参数

Any keyword arguments passed to get_or_create() — except an optional one called defaults — will be used in a get() call. If an object is found, get_or_create() returns a tuple of that object and False. If multiple objects are found, get_or_create raises MultipleObjectsReturned. If an object is not found, get_or_create() will instantiate and save a new object, returning a tuple of the new object and True.

更新或创建

仍然不相信 get_or_create 是这份工作的合适人选?我也不是。还有更好的东西。 update_or_create !!

A convenience method for updating an object with the given kwargs, creating a new one if necessary. The defaults is a dictionary of (field, value) pairs used to update the object.

但我不会详述 update_or_create 的用户,因为代码中尝试更新模型的行已被注释掉,而且您没有明确说明要更新的内容。

新管道

使用标准 API 方法,包含您的管道的模块将简化为 ProductItemPipeline 类。并且可以修改

class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']

model, created = ProductItem.get_or_create(product_company=item['product_company'], barcode=item['bar_code'],
defaults={'Other_field1': value1, 'Other_field2': value2})

if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
return item

if isinstance(item, CommentItem):

model, created = CommentItem.get_or_create(field1=value1, defaults={ other fields go in here'})

if created:
print model
else:
print created
return item

原始代码中的错误

我相信这是错误存在的地方。

  obj = model_class.objects.get(product=model.product, comment=model.comment)

现在我们没有使用它,所以错误应该消失了。如果您仍有问题,请粘贴完整的回溯

关于python - Django 与 Scrapy 的关系如何保存项目?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41448443/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com