gpt4 book ai didi

python - 使用 Scrapy Splash 将响应存储为文件

转载 作者:行者123 更新时间:2023-12-03 19:02:40 25 4
gpt4 key购买 nike

我正在使用 Splash 创建我的第一个 scrapy 项目并使用来自 http://quotes.toscrape.com/js/ 的测试数据
我想将每个页面的引号作为单独的文件存储在磁盘上(在下面的代码中,我首先尝试存储整个页面)。我有下面的代码,当我不使用时有效 SplashRequest ,但是使用下面的新代码,当我在 Visual Studio Code 中“运行和调试”此代码时,现在磁盘上不会存储任何内容。
还有 self.log不会写入我的 Visual Code Terminal 窗口。我是 Splash 的新手,所以我确定我遗漏了一些东西,但是什么?
已查herehere .

import scrapy
from scrapy_splash import SplashRequest

class QuoteItem(scrapy.Item):
author = scrapy.Field()
quote = scrapy.Field()

class MySpider(scrapy.Spider):
name = "jsscraper"


start_urls = ["http://quotes.toscrape.com/js/"]

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

def parse(self, response):
for q in response.css("div.quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote

#cycle through all available pages
for a in response.css('ul.pager a'):
yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })


page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
更新 1
我如何调试它:
  • 在 Visual Studio Code 中,点击 F5
  • 选择“Python 文件”

  • 输出选项卡为空
    终端选项卡包含:
    PS C:\scrapy\tutorial>  cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\Python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
    PS C:\scrapy\tutorial>
    此外,我的 Docker 容器实例中没有记录任何内容,我认为首先需要 Splash 才能正常工作。
    更新 2
    我跑了 scrapy crawl jsscraper并且文件“quotes-js.html”存储在磁盘上。但是,它包含没有执行任何 JavaScript 代码的页面源 HTML。我希望在“http://quotes.toscrape.com/js/”上执行 JS 代码并仅存储引用内容。我怎么能这样做?

    最佳答案

    将输出写入 JSON 文件:
    我试图解决你的问题。这是您的代码的工作版本。我希望这就是你正在努力实现的目标。

    import json

    import scrapy
    from scrapy_splash import SplashRequest


    class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]

    def start_requests(self):
    for url in self.start_urls:
    yield SplashRequest(
    url=url,
    callback=self.parse,
    endpoint='render.html',
    args={'wait': 0.5}
    )

    def parse(self, response):
    quotes = {"quotes": []}
    for q in response.css("div.quote"):
    quote = dict()
    quote["author"] = q.css(".author::text").extract_first()
    quote["quote"] = q.css(".text::text").extract_first()
    quotes["quotes"].append(quote)

    page = response.url[response.url.index("page/")+5:]
    print("page=", page)
    filename = 'quotes-%s.json' % page
    with open(filename, 'w') as outfile:
    outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))
    更新:
    上面的代码已更新为从所有页面抓取并将结果保存在从第 1 页到第 10 页的单独 json 文件中。
    这会将每个页面的引号列表写入单独的 json 文件,如下所示:
    {
    "quotes":[
    {
    "author":"Albert Einstein",
    "quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
    },
    {
    "author":"J.K. Rowling",
    "quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
    },
    {
    "author":"Albert Einstein",
    "quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
    },
    {
    "author":"Jane Austen",
    "quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
    },
    {
    "author":"Marilyn Monroe",
    "quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
    },
    {
    "author":"Albert Einstein",
    "quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
    },
    {
    "author":"Andr\u00e9 Gide",
    "quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
    },
    {
    "author":"Thomas A. Edison",
    "quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
    },
    {
    "author":"Eleanor Roosevelt",
    "quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
    },
    {
    "author":"Steve Martin",
    "quote":"\u201cA day without sunshine is like, you know, night.\u201d"
    }
    ]
    }

    关于python - 使用 Scrapy Splash 将响应存储为文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64350943/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com