gpt4 book ai didi

python - 与 BaseSpider 一起使用的正则表达式会导致 CrawlSpider 出现错误

转载 作者:太空宇宙 更新时间:2023-11-03 18:15:24 26 4
gpt4 key购买 nike

我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我有以下代码,其中包含名为 Datastore.prime 的 Javascript 项目上的正则表达式,我知道该项目肯定存在于我正在尝试使用 BaseSpider 的静态页面上:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
name = "goal4"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1

rules = [Rule(SgmlLinkExtractor(allow=('/Teams',)), follow=True, callback='parse_item')]

def parse_item(self, response):


playerdata = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body).group(1)

for player in json.loads(playerdata):
print player['FirstName'], player['LastName'], player['TeamName'], player['PositionText'], player['PositionLong'] \
, player['Age'] \
, player['Height'], player['Weight'], player['GameStarted'], player['SubOn'], player['SubOff'] \
, player['Goals'], player['OwnGoals'], player['Assists'], player['Yellow'], player['SecondYellow'], player['Red'] \
, player['TotalShots'] \
, player['ShotsOnTarget'], player['ShotsBlocked'], player['TotalPasses'], player['AccuratePasses'], player['KeyPasses'] \
, player['TotalLongBalls'], player['AccurateLongBalls'], player['TotalThroughBalls'], player['AccurateThroughBalls'] \
, player['AerialWon'], player['AerialLost'], player['TotalTackles'], player['Interceptions'], player['Fouls'] \
, player['Offsides'], player['OffsidesWon'], player['TotalClearances'], player['WasDribbled'], player['Dribbles'] \
, player['WasFouled'] \
, player['Dispossesed'], player['Turnovers'], player['TotalCrosses'], player['AccurateCrosses'] \

execute(['scrapy','crawl','goal4'])

当此正则表达式用作 CrawlSpider 的一部分(如上例所示)时,代码会抛出以下错误:

 Traceback (most recent call last):
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
self._startRunCallbacks(result)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Python27\missing\missing\spiders\mrcrawl2.py", line 26, in parse
+ '(\[.*\])' + re.escape(");"), response.body).group(1)
exceptions.AttributeError: 'NoneType' object has no attribute 'group'

我知道这个示例起作用的静态页面可以在这里找到:

http://www.whoscored.com/Teams/705/Archive/Israel-Maccabi-Haifa我假设如果 Scrapy 尝试解析未遇到 DateStore.prime 实例的页面,则会导致上述错误。有人可以告诉我是否:

1)这个假设是正确的2)我如何解决这个问题。我尝试过使用“try:”和“except:”实例,但是我不确定如何编写“如果错误抓取下一页”的代码。

谢谢

最佳答案

问题来自于将方法调用 searchgroup 链接在一起。如果 search 返回 None,则 None.group 引发 AttributeError

相反,将两个方法调用分开并使用if match is not None。例如:

def parse_item(self, response):

match = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body)
if match is not None:
playerdata = match.group(1)

for player in json.loads(playerdata):
...

关于python - 与 BaseSpider 一起使用的正则表达式会导致 CrawlSpider 出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25087072/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com