gpt4 book ai didi

Python 和 json 问题

转载 作者:太空宇宙 更新时间:2023-11-03 14:55:59 25 4
gpt4 key购买 nike

我在解码 json 格式时遇到问题。

这是我的数据。

20110312010116730|{"place":{"country_code":"US","url":"http:\/\/api.twitter.com\/1\/geo\/id\/9fbe124c83c364fe.json","bounding_box":{"type":"Polygon","coordinates":[[[-78.894441,35.03811699],[-78.85501596,35.03811699],[-78.85501596,35.08142904],[-78.894441,35.08142904]]]},"place_type":"neighborhood","name":"Downtown Fayetteville","country":"United States","attributes":{},"id":"9fbe124c83c364fe","full_name":"Downtown Fayetteville, Fayetteville"},"user":{"is_translator":false,"listed_count":9,"statuses_count":3695,"profile_link_color":"9ede14","url":"http:\/\/www.facebook.com\/nicholasd.whitehead","following":null,"verified":false,"profile_sidebar_border_color":"a7ed11","contributors_enabled":false,"profile_use_background_image":true,"friends_count":354,"profile_background_color":"131516","description":" #TEAMDROID #TAURUS #TEAMRATCHET #TEAMFITTEDS !!!! \u2752Single \u2752Taken \u2714SLiCK","profile_background_image_url":"http:\/\/a2.twimg.com\/profile_background_images\/213719493\/lime_green_logo.jpg","created_at":"Thu Jun 18 21:07:16 +0000 2009","protected":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1263451862\/my_shirt_off_normal.jpg","follow_request_sent":null,"time_zone":"Eastern Time (US & Canada)","favourites_count":3,"profile_text_color":"b6e82c","location":"from the 252 to the 910","name":"\u015bl\u00ef\u00e7k \u0148\u00ef\u00e7k","show_all_inline_media":false,"geo_enabled":true,"notifications":null,"profile_sidebar_fill_color":"080808","screen_name":"infamous_SLiCK","id":48490066,"id_str":"48490066","lang":"en","profile_background_tile":true,"utc_offset":-18000,"followers_count":224},"coordinates":{"type":"Point","coordinates":[-78.883968,35.052185]},"text":"i dont even know who Sam & Ronnie is !!","in_reply_to_status_id":null,"truncated":false,"source":"\u003Ca href=\"http:\/\/twidroyd.com\" rel=\"nofollow\"\u003Etwidroyd\u003C\/a\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:01:16 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[35.052185,-78.883968]},"contributors":null,"retweeted":false,"id":46450665555378176,"in_reply_to_user_id_str":null,"id_str":"46450665555378176","entities":{"urls":[],"user_mentions":[],"hashtags":[]},"retweet_count":0}

对于这样的文本,我有超过 200GB 的数据。

这是我的代码。

tweets_data = []
tweets_file = open(tweets_data_path, "r").readlines()
for i,line in enumerate(tweets_file):
if i%2 is 0:
temp = line.split('|')
tweet = json.loads(temp[1])
#tweets_data.append(tweet)

这是我的问题。我试图解码它们。但失败了。起初,我虽然在数据中排在第一位的数字出错了。所以我尝试将数字和 json 数据分开。但它仍然不起作用。因为不同的东西刚刚出现在我的列表中。像这样:

['20110312015935803', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/a1f2dacd80a51287.json","bounding_box":{"type":"Polygon","coordinates":[[[-73.002796,42.990631],[-72.866051,42.990631],[-72.866051,43.119106],[-73.002796,43.119106]]]},"place_type":"city","name":"Stratton","country":"United States","attributes":{},"id":"a1f2dacd80a51287","full_name":"Stratton, VT"},"user":{"follow_request_sent":null,"show_all_inline_media":false,"geo_enabled":true,"profile_link_color":"546080","url":"http:\\/\\/www.facebook.com\\/br.vivizanatta","following":null,"verified":false,"profile_sidebar_border_color":"bcc7e3","is_translator":false,"listed_count":0,"statuses_count":330,"profile_use_background_image":true,"profile_background_color":"2d313f","description":"Stay up to date with news, photos, videos, blog, bio and more from the brazilian journalist and photographer Vivian Zanatta.","contributors_enabled":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/211883639\\/aspen1_829_49514.jpg","created_at":"Sat Jul 10 15:05:48 +0000 2010","friends_count":79,"protected":false,"profile_image_url":"http:\\/\\/a2.twimg.com\\/profile_images\\/1259071695\\/VIVI_DDH7912_normal.jpg","time_zone":"Eastern Time (US & Canada)","favourites_count":0,"profile_text_color":"537de6","location":"Washington, DC, USA","name":"Vivi Zanatta \\u2714","notifications":null,"profile_sidebar_fill_color":"191e2a","screen_name":"vivizanatta_","id":165082798,"id_str":"165082798","lang":"en","profile_background_tile":false,"utc_offset":-18000,"followers_count":83},"coordinates":{"type":"Point","coordinates":[-72.9053683,43.1134486]},"text":"I\'m at Stratton Mountain Ski Resort (5 Village Lodge Rd, Stratton Mountain) http:\\/\\/4sq.com\\/i3ULvp","in_reply_to_status_id":null,"truncated":false,"source":"\\u003Ca href=\\"http:\\/\\/foursquare.com\\" rel=\\"nofollow\\"\\u003Efoursquare\\u003C\\/a\\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[43.1134486,-72.9053683]},"contributors":null,"retweeted":false,"id":46465342800797698,"in_reply_to_user_id_str":null,"id_str":"46465342800797698","entities":{"hashtags":[],"urls":[{"indices":[76,97],"url":"http:\\/\\/4sq.com\\/i3ULvp","expanded_url":null}],"user_mentions":[]},"retweet_count":0}\n']
['\n']

突然出现['\n']。好吧,我猜是因为行由两个 ['\n'] 分隔。无论如何,当我使用分区时,

('20110312015935977', '|', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/b8b87894eb3d7849.json","bounding_box":{"type":"Polygon","coordinates":[[[-95.542521,29.670631],[-95.492419,29.670631],[-95.492419,29.694855],[-95.542521,29.694855]]]},"place_type":"neighborhood","name":"Braeburn","country":"United States","attributes":{},"id":"b8b87894eb3d7849","full_name":"Braeburn, Houston"},"user":{"profile_link_color":"ed0909","url":null,"following":null,"verified":false,"profile_sidebar_border_color":"f00505","follow_request_sent":null,"show_all_inline_media":true,"geo_enabled":true,"profile_use_background_image":true,"profile_background_color":"61b8c2","description":"#TeamPlaystation #TeamLRG #TeamAquarius and #PvNation .It bring me great pleasure to welcome the real and banish the Fake...","is_translator":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/179334599\\/screwston7jsredc.jpg","listed_count":0,"statuses_count":163,"created_at":"Wed Dec 08 04:04:16 +0000 2010","protected":false,"profile_image_url":"http:\\/\\/a0.twimg.com\\/profile_images\\/1256895503\\/image_normal.jpg","time_zone":"Central America","favourites_count":2,"profile_text_color":"fa0505","location":"Houston, Tx","name":"Craig Irving","contributors_enabled":false,"notifications":null,"profile_sidebar_fill_color":"020303","screen_name":"xxMinion","id":224098461,"id_str":"224098461","lang":"en","profile_background_tile":true,"utc_offset":-21600,"friends_count":36,"followers_count":35},"coordinates":null,"text":"If your White or Mexican #WhoSaidItWasOk to say \\"whats up my nigga\\" and then call your homeboys the word Nigga lol","in_reply_to_status_id":null,"truncated":false,"source":"web","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":null,"contributors":null,"retweeted":false,"id":46465343463505920,"in_reply_to_user_id_str":null,"id_str":"46465343463505920","entities":{"urls":[],"user_mentions":[],"hashtags":[{"indices":[25,40],"text":"WhoSaidItWasOk"}]},"retweet_count":0}\n')
('\n', '', '')

它出现了。

哦,我的数据格式是 gz。如何在不解压的情况下阅读python?

最佳答案

如果你的数据中有|split分割过多,json字符串被截断

您可以使用maxsplit 参数

temp = line.split('|',1)

分区:

temp = line.partition('|')

(在这种情况下使用 temp[2] 因为分隔符也被返回)

如果您还有其他问题,请考虑为每一行添加一个 try/except block ,以便缩小问题范围。

编辑:还添加了针对空行的保护作为您编辑的跟进。

tweets_file = open(tweets_data_path, "r")
for i,line in enumerate(tweets_file):
if i%2 == 0:
try:
data = line.partition('|')[2]
if data:
tweet = json.loads(data)
except ValueError as e:
print("Cannot parse '{}'".format(data)
print("Error line {}: {}".format(i+1,str(e)))

关于Python 和 json 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42557435/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com