gpt4 book ai didi

python - 将 pycurl 与 gzip 流一起使用时出现错误 "Extra data: line 2 column 1"

转载 作者:太空狗 更新时间:2023-10-29 21:44:07 26 4
gpt4 key购买 nike

感谢阅读。

背景:我正在尝试读取以 JSON 格式返回数据的流式 API 提要,然后将此数据存储到 pymongo 集合。流式 API 需要一个 "Accept-Encoding": "Gzip" header 。

发生了什么: json.loads 上的代码失败并输出 - Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597)(请参阅下面的错误日志)

这不会在解析每个 JSON 对象时发生——它是随机发生的。

我的猜测是我在每个“x”个正确的 JSON 对象之后遇到了一些奇怪的 JSON 对象。

我确实引用了 how to use pycurl if requested data is sometimes gzipped, sometimes not?Encoding error while deserializing a json object from Google但到目前为止未能成功解决此错误。

有人可以帮我解决这个问题吗?

错误日志:注意:下面 JSON 对象的原始转储基本上使用 repr() 方法打印字符串的原始表示,而不解析 CRLF/LF(s)。


'{"id":"tag:search.twitter.com,2005:207958320747782146","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:493653150","link":"http://www.twitter.com/Deathnews_7_24","displayName":"Death News 7/24","postedTime":"2012-02-16T01:30:12.000Z","image":"http://a0.twimg.com/profile_images/1834408513/deathnewstwittersquare_normal.jpg","summary":"Crashes, Murders, Suicides, Accidents, Crime and Naturals Death News From All Around World","links":[{"href":"http://www.facebook.com/DeathNews724","rel":"me"}],"friendsCount":56,"followersCount":14,"listedCount":1,"statusesCount":1029,"twitterTimeZone":null,"utcOffset":null,"preferredUsername":"Deathnews_7_24","languages":["tr"]},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"web","link":"http://twitter.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","body":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958320747782146","summary":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nytimes.com/2012/05/30/boo\xe2\x80\xa6","indices":[52,72],"expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html","url":"http://t.co/WBsNlNtA"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":11,"urls":[{"url":"http://t.co/WBsNlNtA","expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html?_r=1"}]}}\r\n{"id":"tag:search.twitter.com,2005:207958321003638785","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:178760897","link":"http://www.twitter.com/Mobanu","displayName":"Donald Ochs","postedTime":"2010-08-15T16:33:56.000Z","image":"http://a0.twimg.com/profile_images/1493224811/small_mobany_Logo_normal.jpg","summary":"","links":[{"href":"http://www.mobanuweightloss.com","rel":"me"}],"friendsCount":10272,"followersCount":9698,"listedCount":30,"statusesCount":725,"twitterTimeZone":"Mountain Time (US & Canada)","utcOffset":"-25200","preferredUsername":"Mobanu","languages":["en"],"location":{"objectType":"place","displayName":"Crested Butte, Colorado"}},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"twitterfeed","link":"http://twitterfeed.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Mobanu/statuses/207958321003638785","body":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958321003638785","summary":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","link":"http://twitter.com/Mobanu/statuses/207958321003638785","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nyti.ms/KUmmMa","indices":[116,136],"expanded_url":"http://nyti.ms/KUmmMa","url":"http://t.co/mTsQlNQO"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":12,"urls":[{"url":"http://t.co/mTsQlNQO","expanded_url":"http://well.blogs.nytimes.com/2012/05/30/can-exercise-be-bad-for-you/?utm_medium=twitter&utm_source=twitterfeed"}]}}\r\n'
json exception: Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597)

header 输出:


HTTP/1.1 200 OK

Content-Type: application/json; charset=UTF-8

Vary: Accept-Encoding

Date: Wed, 30 May 2012 22:14:48 UTC

Connection: close

Transfer-Encoding: chunked

Content-Encoding: gzip

get_stream.py:


#!/usr/bin/env python
import sys
import pycurl
import json
import pymongo

STREAM_URL = "https://stream.test.com:443/accounts/publishers/twitter/streams/track/Dev.json"
AUTH = "userid:passwd"

DB_HOST = "127.0.0.1"
DB_NAME = "stream_test"

class StreamReader:
def __init__(self):
try:
self.count = 0
self.buff = ""
self.mongo = pymongo.Connection(DB_HOST)
self.db = self.mongo[DB_NAME]
self.raw_tweets = self.db["raw_tweets_gnip"]
self.conn = pycurl.Curl()
self.conn.setopt(pycurl.ENCODING, 'gzip')
self.conn.setopt(pycurl.URL, STREAM_URL)
self.conn.setopt(pycurl.USERPWD, AUTH)
self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
self.conn.setopt(pycurl.HEADERFUNCTION, self.header_rcvd)
while True:
self.conn.perform()
except Exception as ex:
print "error ocurred : %s" % str(ex)

def header_rcvd(self, header_data):
print header_data

def on_receive(self, data):
temp_data = data
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
try:
tweet = json.loads(self.buff, encoding = 'UTF-8')
self.buff = ""
if tweet:
try:
self.raw_tweets.insert(tweet)
except Exception as insert_ex:
print "Error inserting tweet: %s" % str(insert_ex)
self.count += 1

if self.count % 10 == 0:
print "inserted "+str(self.count)+" tweets"
except Exception as json_ex:
print "json exception: %s" % str(json_ex)
print repr(temp_data)



stream = StreamReader()

固定代码:


def on_receive(self, data):
self.buff += data
if data.endswith("\r\n") and self.buff.strip():
# NEW: Split the buff at \r\n to get a list of JSON objects and iterate over them
json_obj = self.buff.split("\r\n")
for obj in json_obj:
if len(obj.strip()) > 0:
try:
tweet = json.loads(obj, encoding = 'UTF-8')
except Exception as json_ex:
print "JSON Exception occurred: %s" % str(json_ex)
continue

最佳答案

尝试将转储的字符串粘贴到 jsbeatuifier 中。

你会看到它实际上是两个 json 对象,而不是一个,json.loads 无法处理。

它们之间用\r\n分隔,所以应该很容易拆分。

问题是传递给 on_receivedata 参数如果包含换行符则不一定以 \r\n 结尾。如图所示,它也可以位于字符串中间的某个位置,因此仅查看数据 block 的末尾是不够的。

关于python - 将 pycurl 与 gzip 流一起使用时出现错误 "Extra data: line 2 column 1",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10825189/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com