gpt4 book ai didi

python - Web 抓取规则创建

转载 作者:行者123 更新时间:2023-11-29 00:16:31 25 4
gpt4 key购买 nike

我在这个页面上:http://www.metacritic.com/browse/games/title/ps4/a?view=condensed

我想进入每个项目并获取开发者和流派,但我的代码似乎不起作用。

比如我要进入这个页面:http://www.metacritic.com/game/playstation-4/angry-birds-star-wars

然后离开它并继续执行其余操作并添加到数据库中。我可以在我的代码中更改什么以使其工作?现在数据库是为开发者准备的,流派是空的,但它得到了其余的数据,所以它就像永远不会进入 parse_Game

另外,我在 parseGame 中添加了 print 语句,但它们都没有打印

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from metacritic.items import MetacriticItem
import MySQLdb
import re
from string import lowercase

class MetacriticSpider(BaseSpider):
def start_requests(self):
#iterate through ps4 pages
for c in lowercase:
for i in range(self.max_id):
yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback = self.parseps4)

#gets the developer and genre of a game
def parseGame(self, response):

print("Here")

item = response.meta['item']

db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []

item['dev'] = site.xpath('.//span[contains(@class, "summary_detail developer")]/span[1]/text()').extract()
item['genre'] = site.xpath('.//span[contains(@class, "summary_detail product_genre")]/span[1]/text()').extract()

cursor.execute("INSERT INTO ps4 (dev, genre) VALUES (%s,%s)",[item['dev'][0],item['genre'][0]])
items.append(item)

print item['dev']
print item['genre']

def parseps4(self, response):
#some local variables
db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []

#iterates through each site
for site in sites:
with db1:
item = MetacriticItem()

#sets the item
item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()
item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()
item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()
item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()

#some processing to check if there is a score attached, if there is, it adds it to the database
if ("tbd" in item['cscore'][0] and "tbd" not in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" not in item['uscore'][0]):
cursor.execute("INSERT INTO ps4 (title, criticalscore, userscore, releasedate) VALUES (%s,%s,%s, %s)",[(' '.join(item['title'][0].split())).replace("(PS4)","",1),item['cscore'][0],item['uscore'][0],item['release'][0]])
items.append(item)

itemLink = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href' ).extract()

req = Request('http://www.metacritic.com' + itemLink[0], callback = self.parseGame)
req.meta['item'] = item

最佳答案

代码中的几个问题:

  • 元参数应该包含字典 {'item': item}
  • HtmlXPathSelector 已弃用 - 请改用 Selector
  • 我认为你不应该在 spider 中执行 mysql 插入 - 而是使用数据库管道:
  • 您需要获取 extract() 调用的第一项并对其执行 strip()(这将有助于在字段中包含字符串,而不是列表和没有前导和尾随空格和换行符)

下面是没有mysql相关调用的代码:

from string import lowercase

from scrapy.item import Field, Item
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector, Selector

from metacritic.items import MetacriticItem


class MetacriticSpider(BaseSpider):
name = 'metacritic'
allowed_domains = ['metacritic.com']

max_id = 1 # your max_id value goes here!!!

def start_requests(self):
for c in lowercase:
for i in range(self.max_id):
yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback=self.parseps4)

def parseGame(self, response):
item = response.meta['item']
hxs = HtmlXPathSelector(response)
site = hxs.select('//div[@class="product_wrap"]')

# get additional data!!!

yield item

def parseps4(self, response):
hxs = Selector(response)
sites = hxs.select('//div[@class="product_wrap"]')
for site in sites:
item = MetacriticItem()
item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()[0].strip()
item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()[0].strip()
item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()[0].strip()
item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()[0].strip()

link = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href').extract()[0]
yield Request('http://www.metacritic.com/' + link, meta={'item': item}, callback=self.parseGame)

它对我有用——我在控制台上看到了 parseGame() 生成的项目。

确保它首先生成项目,然后查看 !!! 注释 - 相应地填写这些行。

之后,如果您在控制台上看到项目,请尝试创建一个数据库管道以将项目写入 mysql。

关于python - Web 抓取规则创建,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22792251/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com