python - Web 抓取规则创建-6ren

python - Web 抓取规则创建

转载作者：行者123 更新时间：2023-11-29 00:16:31

25

4

我在这个页面上:http://www.metacritic.com/browse/games/title/ps4/a?view=condensed

我想进入每个项目并获取开发者和流派，但我的代码似乎不起作用。

比如我要进入这个页面:http://www.metacritic.com/game/playstation-4/angry-birds-star-wars

然后离开它并继续执行其余操作并添加到数据库中。我可以在我的代码中更改什么以使其工作？现在数据库是为开发者准备的，流派是空的，但它得到了其余的数据，所以它就像永远不会进入 parse_Game

另外，我在 parseGame 中添加了 print 语句，但它们都没有打印

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from metacritic.items import MetacriticItem
import MySQLdb
import re
from string import lowercase

class MetacriticSpider(BaseSpider):
def start_requests(self):
    #iterate through ps4 pages
    for c in lowercase:
        for i in range(self.max_id):
            yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback = self.parseps4)

    #gets the developer and genre of a game
def parseGame(self, response):

    print("Here")

    item = response.meta['item']

    db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
    cursor = db1.cursor()
    hxs = HtmlXPathSelector(response)   
    sites = hxs.select('//div[@class="product_wrap"]')
    items = []

    item['dev'] = site.xpath('.//span[contains(@class, "summary_detail developer")]/span[1]/text()').extract()
    item['genre'] = site.xpath('.//span[contains(@class, "summary_detail product_genre")]/span[1]/text()').extract()    

    cursor.execute("INSERT INTO ps4 (dev, genre) VALUES (%s,%s)",[item['dev'][0],item['genre'][0]])
    items.append(item)

    print item['dev']
    print item['genre']

def parseps4(self, response):
    #some local variables
    db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
    cursor = db1.cursor()
    hxs = HtmlXPathSelector(response)   
    sites = hxs.select('//div[@class="product_wrap"]')
    items = []

    #iterates through each site
    for site in sites:
        with db1:
            item = MetacriticItem()

            #sets the item
            item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()
            item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract() 
            item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()
            item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()

            #some processing to check if there is a score attached, if there is, it adds it to the database
            if ("tbd" in item['cscore'][0] and "tbd" not in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" not in item['uscore'][0]):
                cursor.execute("INSERT INTO ps4 (title, criticalscore, userscore, releasedate) VALUES (%s,%s,%s, %s)",[(' '.join(item['title'][0].split())).replace("(PS4)","",1),item['cscore'][0],item['uscore'][0],item['release'][0]])
                items.append(item)

            itemLink = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href' ).extract()

            req = Request('http://www.metacritic.com' +  itemLink[0], callback = self.parseGame)
            req.meta['item'] = item

最佳答案

代码中的几个问题:

元参数应该包含字典 {'item': item}
HtmlXPathSelector 已弃用 - 请改用 Selector
我认为你不应该在 spider 中执行 mysql 插入 - 而是使用数据库管道:
- Writing items to a MySQL database in Scrapy
您需要获取 extract() 调用的第一项并对其执行 strip()(这将有助于在字段中包含字符串，而不是列表和没有前导和尾随空格和换行符)

下面是没有mysql相关调用的代码:

from string import lowercase

from scrapy.item import Field, Item
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector, Selector

from metacritic.items import MetacriticItem


class MetacriticSpider(BaseSpider):
    name = 'metacritic'
    allowed_domains = ['metacritic.com']

    max_id = 1 # your max_id value goes here!!!

    def start_requests(self):
        for c in lowercase:
            for i in range(self.max_id):
                yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback=self.parseps4)

    def parseGame(self, response):
        item = response.meta['item']
        hxs = HtmlXPathSelector(response)
        site = hxs.select('//div[@class="product_wrap"]')

        # get additional data!!!

        yield item

    def parseps4(self, response):
        hxs = Selector(response)
        sites = hxs.select('//div[@class="product_wrap"]')
        for site in sites:
            item = MetacriticItem()
            item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()[0].strip()
            item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()[0].strip()
            item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()[0].strip()
            item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()[0].strip()

            link = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href').extract()[0]
            yield Request('http://www.metacritic.com/' + link, meta={'item': item}, callback=self.parseGame)

它对我有用——我在控制台上看到了 parseGame() 生成的项目。

确保它首先生成项目，然后查看 !!! 注释 - 相应地填写这些行。

之后，如果您在控制台上看到项目，请尝试创建一个数据库管道以将项目写入 mysql。

关于python - Web 抓取规则创建，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22792251/

25

4

0

文章推荐： android - ViewPager Fragments 在应用程序处于后台时被破坏？

文章推荐： android - picasso 图像在 ListView 中滚动时重新加载

文章推荐： php - 如何使用多个值和 SELECT 语句插入？

kubernetes - CORS 规则 nginx-ingress 规则
我需要在 nginx-ingress 版本上允许来自多个来源的请求:http://localhost:4200、http://localhost:4242 等1.7.1.但我无法对多个来源执行此操作，
apache - htaccess 规则 (mod_rewrite) 转换为 web.config 规则
我正在部署我使用 APIGILITY 开发的 API到 IIS。由于 IIS 不支持 .htaccess，我试图从 .htaccess 文件的内容创建 web.config 文件。我使用 IISv7.
"google' s 检查元素上的 html 规则 VS css 规则”
我正在尝试更改上面 css 样式中的“宽度”规则。在“inspect element”中你可以看到宽度是1008px。我不希望它是 1008px 但它不会让我在 css 样式中更改它你可以看到它被“删
css - 每个 html 元素的 css 规则 VS 每个元素的几个简单的 css 规则？
外部css赋值有2种方法，我用的是第一种；大多数网站使用第二种方法。我想知道我是否做错了! 第一种方法: 为几乎每个 css 规则创建一个类并在任何地方使用它们。 blah blah .f_
03、RDF 规则
RDF使用 WEB 标识符 (URIs) 来标识资源，使用属性和属性值来描述资源 RDF 资源、属性和属性值 RDF使用 WEB 标识符来标识事物，并通过属性和属性值来描述资源。关于资源、属性
R 规则，仅我的规则来自特定列
我想挖掘特定的 rhs 规则。文档中有一个示例证明这是可能的，但仅适用于特定情况(如下所示)。先来一个数据集来说明我的问题: input {b=100002} 0.2500000 0.250000
服务根的 nginx 规则
我想让 nginx 从网站根目录(:http://localhost:8080/)提供一个静态文件，但它为我的代理通行证提供服务；它提供“/”规则而不是“=/”。这是我的 nginx 配置的样子:
具有用于单次调用的多个目标的 Makefile 规则
根据gnu make documentation , 如果一个规则通过一次调用生成多个目标(例如，一个配方执行一个带有多个输出文件的工具)，你可以使用 '&:' 规则语法来告诉 make。但是，当在多
Firebase 规则 : What is . 包含()？
我已阅读Firebase Documentation并且不明白什么是 .contains()。以下是文档中 Firebase 数据库的示例规则: { "rules": { "rooms"
haskell - 函数内的格式化语句 - 规则？
关闭。这个问题是opinion-based 。目前不接受答案。想要改进这个问题吗？更新问题，以便 editing this post 可以用事实和引文来回答它。 . 已关闭 6 年前。 Improv
java多态后期绑定(bind)规则
我正在尝试做一些多态性练习，但我无法弄清楚这种多态性是如何工作的。我没有找到任何关于这种练习的深入信息。希望大家能给我一些解释。练习1: class Top { public void m(
保留中间文件的 Makefile 规则
为了调试复杂的 XSLT 转换，我将其分为几个部分:首先构建 %.1.xml，然后使用它构建 %.2.xml ，最后构建 %.3.xml。一切正常，但如果我要求 Make 构建最后一个，Make 总是
python - 如何添加验证特征/规则？
我尝试了 hacerrank 的 slove 练习 Click我不知道如何添加这些规则: ► 它可以包含 4 个一组的数字，并用一个连字符“-”分隔。 ► 不得有 4 个或更多连续重复数字。这是我的
c# - 我如何使声明遵循与以前相同的 "if"规则
我正在尝试编写一个小测验，我希望“再试一次”按钮遵循与“else”之前的“if”语句相同的规则 using System; public class Program { public stat
java - Spring服务方法和复杂的验证逻辑/规则
在我的 Spring/Boot Java 项目中，我有一组服务方法，例如以下一个: @Override public Decision create(String name, String descr
协变虚函数的 C++ 规则
我正在阅读 Covariant virtual function .上面写着假设 B::f 覆盖了虚函数 A::f。如果满足以下所有条件，A::f 和 B::f 的返回类型可能不同: 1) The
iOS 企业开发者计划 - 规则
我工作的公司想要分发(在公共(public)链接中)具有内部签名的应用程序。我很确定 Apple 否认这种事情，但我在官方文档/契约(Contract)中没有找到任何相关信息。有谁知道它到底是如何工
页面加载时不应用 css 规则
我是 CSS 新手。我观察到一个奇怪的 CSS 行为，其中一个元素具有以下 CSS 属性 .container .header{ color: #FFFFFF; font-size: 2em;
基于内容的 CSS 规则
这个问题在这里已经有了答案: Is there a CSS selector for elements containing certain text? (21 个答案) 关闭 7 年前。
不应用 CSS 规则
我有以下 CSS: workoutcal.css: .errorlist{ color:red; } 以下基本模板: base.html: {% load static %} {

首页

博学

6Ren·AI

商城

python - Web 抓取规则创建