gpt4 book ai didi

python蜘蛛返回空json文件

转载 作者:行者123 更新时间:2023-12-01 00:01:39 26 4
gpt4 key购买 nike

我在 python 中创建了 Json 文件来使用 scrapy 存储抓取的数据,但是 json 文件是空的,尽管 python scrapy 蜘蛛抓取了所有数据。我试图将所有抓取的数据存储到 json 文件中。在爬行命令蜘蛛显示所有的终端中数据但它没有导入到 json 文件中。我找不到任何解决方案我正在共享文件 Spider 和 items.py

我使用这个命令scrapy scraper -o products.json

Spider.py

import scrapy
from bs4 import BeautifulSoup as Soup
from ..items import ScrapyArbiItem
import requests
from idna import unicode


class Scraper(scrapy.Spider):
name = "scraper"

start_urls = [
'https://www.fenom.com/en/263-men',
# 'https://www.fenom.com/en/263-men#/page-2',
# 'https://www.fenom.com/en/263-men#/page-3',
# 'https://www.fenom.com/en/263-men#/page-4',
# 'https://www.fenom.com/en/263-men#/page-5',
# 'https://www.fenom.com/en/263-men#/page-6',
# 'https://www.fenom.com/en/263-men#/page-7',
]

def parse(self, response):

items = ScrapyArbiItem()

page_soup = Soup(response.text, 'html.parser')
uls = page_soup.find_all("ul", class_="product_list grid row")[0]
# import pdb;
# pdb.set_trace()
for li in uls.find_all("li", class_="ajax_block_product block_home col-xs-6 col-sm-4
col-md-3"):
data_to_write = []
try:
# print("gnbfrgjrnbgfjnbruigbnruig")
div = li.find('div', class_='product-container')
left_block = div.find('div', class_="left-block")
image_container = left_block.find('div', class_="product-image-container")
image = image_container.find('a')
image_url_a = image_container.find('a', class_="product_img_link")
image_url = image_url_a.find('img', class_='replace-2x img-responsive')
image_url = image_url.get('src') # image_url
url = image.get('href') # url of product
right_block = div.find('div', class_="right-block")
right_a = right_block.find('a')
product = right_a.find('span', class_="product-name")
product_name = product.text # product_name
pp = right_a.find('span', class_="content_price")
product_p = pp.find('span', class_="product-price")
product_price = product_p.text # product_price


items ['product_name'] = product_name
items['product_price'] = product_price
items['url'] = url


print(items)
#print(product_name)
#print(product_price)
#print(url)
#print(image_url)
next_page = url
# import pdb;pdb.set_trace()
# print(url)
# if url:
# yield scrapy.Request(url, callback=self.parsetwo, dont_filter=True)
except:
pass

items.py

该文件中的作用是将所有提取的数据排列到临时容器中

import scrapy

class ScrapyArbiItem(scrapy.Item):
# define the fields for your item here like:
product_name = scrapy.Field()
product_price = scrapy.Field()
url = scrapy.Field()

最佳答案

我使用yield(items)而不是print(items),它解决了问题。

`import scrapy
from bs4 import BeautifulSoup as Soup
from ..items import ScrapyArbiItem
import requests
from idna import unicode


class Scraper(scrapy.Spider):
name = "scraper"

page_number = 2 #for paginatiom

start_urls = [
'https://www.fenom.com/en/263-men#/page-1', #firstpage
]

def parse(self, response):

items = ScrapyArbiItem() #for items container-storing extracted data

page_soup = Soup(response.text, 'html.parser')
uls = page_soup.find_all("ul", class_="product_list grid row")[0]

for li in uls.find_all("li", class_="ajax_block_product block_home col-xs-6 col-sm-4 col-md-3"):

try:
# print("gnbfrgjrnbgfjnbruigbnruig")
div = li.find('div', class_='product-container')
left_block = div.find('div', class_="left-block")
image_container = left_block.find('div', class_="product-image-container")
image = image_container.find('a')
image_url_a = image_container.find('a', class_="product_img_link")
image_url = image_url_a.find('img', class_='replace-2x img-responsive')
image_url = image_url.get('src') # image_url
url = image.get('href') # url of product
right_block = div.find('div', class_="right-block")
right_a = right_block.find('a')
product = right_a.find('span', class_="product-name")
product_name = product.text # product_name
pp = right_a.find('span', class_="content_price")
product_p = pp.find('span', class_="product-price")
product_price = product_p.text # product_price


items ['product_name'] = product_name
items['product_price'] = product_price
items['url'] = url


yield (items)
#print(product_name)
#print(product_price)
#print(url)
#print(image_url)
except:
pass`

关于python蜘蛛返回空json文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60305178/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com