gpt4 book ai didi

python - 通过 Scrapy 抓取 Google Analytics

转载 作者:太空狗 更新时间:2023-10-30 01:27:52 25 4
gpt4 key购买 nike

我一直在尝试使用 Scrapy 从 Google Analytics 获取一些数据,尽管我是一个完全的 Python 新手,但我已经取得了一些进展。我现在可以通过 Scrapy 登录 Google Analytics,但我需要发出 AJAX 请求来获取我想要的数据。我试图用下面的代码复制浏览器的 HTTP 请求 header ,但它似乎不起作用,我的错误日志显示

too many values to unpack

有人可以帮忙吗?搞了两天,感觉很接近,但也很迷茫。

代码如下:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import json

class LoginSpider(BaseSpider):
name = 'super'
start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']

def parse(self, response):
return [FormRequest.from_response(response,
formdata={'Email': 'Email'},

callback=self.log_password)]


def log_password(self, response):
return [FormRequest.from_response(response,
formdata={'Passwd': 'Password'},

callback=self.after_login)]

def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",
method='POST',
headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'Galaxy-Ajax': 'true',
'Origin': 'https://analytics.google.com',
'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
'User-Agent': 'My-user-agent',
'X-GAFE4-XSRF-TOKEN': 'Mytoken'}],
callback=self.parse_tastypage, dont_filter=True)


def parse_tastypage(self, response):
response = json.loads(jsonResponse)

inspect_response(response, self)
yield item

这是日志的一部分:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login
callback=self.parse_tastypage, dont_filter=True)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__
self.headers = Headers(headers or {}, encoding=encoding)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__
super(Headers, self).__init__(seq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__
self.update(seq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update
super(CaselessDict, self).update(iseq)
File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>
iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 75986,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),
'log_count/DEBUG': 6,

最佳答案

你的错误是因为标题需要是一个字典,而不是一个字典中的列表:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',

'Galaxy-Ajax': 'true',
'Origin': 'https://analytics.google.com',
'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',
},

这将解决您当前的问题,但您将收到 411,因为您还需要指定内容长度,如果您添加要从中提取的内容,我将能够向您展示如何操作。你可以看到下面的输出:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed

关于python - 通过 Scrapy 抓取 Google Analytics,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36273347/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com