- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我有一个文件,其中包含从 python-tweetstreamer 中提取的数百行 json 编码的推文。这些行看起来像:
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "kasian pak weking :| RT @veNikenD: Kasian kenapa???RT @SaputraJordhy: kasian \u256e(\u256f_\u2570)\u256d RT @veNikenD: Tak ingin lg kudengar kata2 yg tak ......", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [{"indices": [24, 33], "screen_name": "veNikenD", "id": 64910664, "name": "Ve Damayanti", "id_str": "64910664"}, {"indices": [54, 68], "screen_name": "SaputraJordhy", "id": 414675856, "name": "jordhy_ynwa", "id_str": "414675856"}, {"indices": [88, 97], "screen_name": "veNikenD", "id": 64910664, "name": "Ve Damayanti", "id_str": "64910664"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": null, "id_str": "187179645026836481", "in_reply_to_screen_name": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/418679759/Young_Minato_and_Kushina_by_HaNa7.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1953774147/Untitled1_normal.png", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 414675856, "profile_text_color": "1c181c", "followers_count": 46, "protected": false, "location": "", "default_profile_image": false, "listed_count": 0, "utc_offset": 25200, "statuses_count": 409, "description": "never walk alone", "friends_count": 76, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/1953774147/Untitled1_normal.png", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "414675856", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/418679759/Young_Minato_and_Kushina_by_HaNa7.jpg", "screen_name": "SaputraJordhy", "lang": "id", "profile_background_tile": true, "favourites_count": 0, "name": "jordhy_ynwa", "url": null, "created_at": "Thu Nov 17 10:41:05 +0000 2011", "contributors_enabled": false, "time_zone": "Jakarta", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645026836481, "source": "<a href=\"https://embr.in\" rel=\"nofollow\">embr</a>"}
{"favorited": false, "in_reply_to_user_id": 441527150, "contributors": null, "truncated": false, "text": "@akoriko1046 \u5bdd\u308b\u306e\uff1f\u3000\u5f85\u3063\u3066\u50d5\u3082\u884c\u304f\u3088\u2026\u5e03\u56e3\u307e\u3067\u304a\u59eb\u69d8\u62b1\u3063\u3053\u3057\u3066\u3044\u3063\u3066\u3042\u3052\u308b", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179532103598080, "coordinates": null, "in_reply_to_user_id_str": "441527150", "entities": {"user_mentions": [{"indices": [0, 12], "screen_name": "akoriko1046", "id": 441527150, "name": "\u30a2\u30b3\u30ea\u30b3", "id_str": "441527150"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179532103598080", "id_str": "187179645014253568", "in_reply_to_screen_name": "akoriko1046", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/440543906/122004876_org.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1870697453/__________bg_normal.jpg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 513679998, "profile_text_color": "333333", "followers_count": 169, "protected": false, "location": "\u3042\u306a\u305f\u306e\u96a3", "default_profile_image": false, "listed_count": 2, "utc_offset": 32400, "statuses_count": 6024, "description": "\u8584\u685c\u9b3c\u6c96\u7530\u7dcf\u53f8\u306e\u975e\u516c\u5f0fbot\u3067\u3059\u7518\u7518/\u30a8\u30ed\u8a2d\u5b9a\u3000\uff8c\uff6b\uff9b\uff70\u306e\u518d\u306f\u5fc5\u305a\u8aac\u660e\u66f8\u3092\u4e00\u8aad\u4e0b\u3055\u3044http://www.pixiv.net/novel/show.php?id=934499 \u624b\u52d5\u3067\u30d5\u30a9\u30ed\u8fd4\u3057\u3092\u884c\u3063\u3066\u307e\u3059\u3000\u7a00\u306b\u4e2d\u306b\u7ba1\u7406\u4eba\u304c\u3044\u307e\u3059\u3000\u7ba1\u7406\u4eba@akanemam1 18\u7981\u7dcf\u53f8\u2192 @sou_oki_18bot", "friends_count": 166, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/1870697453/__________bg_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "513679998", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/440543906/122004876_org.jpg", "screen_name": "sou_oki_bot", "lang": "ja", "profile_background_tile": false, "favourites_count": 1, "name": "\u7dcf\u53f8(bot)", "url": null, "created_at": "Sat Mar 03 22:36:15 +0000 2012", "contributors_enabled": false, "time_zone": "Irkutsk", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645014253568, "source": "<a href=\"http://twittbot.net/\" rel=\"nofollow\">twittbot.net</a>"}
{"favorited": false, "in_reply_to_user_id": 141448885, "contributors": null, "truncated": false, "text": "@nobuttu3 \u6642\u9593\u304c\u904e\u304e\u308b\u306e\u304c\u7269\u51c4\u304f\u65e9\u3044\u3067\u3059\u3088\u306d\u2026", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179547098234880, "coordinates": null, "in_reply_to_user_id_str": "141448885", "entities": {"user_mentions": [{"indices": [0, 9], "screen_name": "nobuttu3", "id": 141448885, "name": "\u306e\u4ecf \uf8ff \u30bf\u30ab\u30cf\u30b7", "id_str": "141448885"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179547098234880", "id_str": "187179645047799808", "in_reply_to_screen_name": "nobuttu3", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/424878981/tw_hvt_nz.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1223811261/twitter_icon_normal.png", "profile_sidebar_fill_color": "daecf4", "is_translator": false, "id": 97481308, "profile_text_color": "663B12", "followers_count": 436, "protected": false, "location": "\u6771\u4eac\u90fd\u53f0\u6771\u533a", "default_profile_image": false, "listed_count": 20, "utc_offset": 32400, "statuses_count": 63704, "description": "\u591a\u5206PG\u3001\u6642\u3005SE\u307d\u3044\u4ed5\u4e8b\u3092\u3057\u3066\u3044\u307e\u3059\u3002\u30e9\u30ce\u30d9\u597d\u304d\u3001\u97f3\u697d\u597d\u304d(\u7279\u5b9a\u306e\u5206\u91ce\u3067\u3059\u304c)\u3002\u30bd\u30b3\u30bd\u30b3\u306e\u983b\u5ea6\u3067\u79cb\u8449\u539f\u306b\u3044\u305f\u308a\u3082\u3057\u307e\u3059\u3002 ", "friends_count": 896, "profile_link_color": "1F98C7", "profile_image_url": "http://a0.twimg.com/profile_images/1223811261/twitter_icon_normal.png", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "ffffff", "id_str": "97481308", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/424878981/tw_hvt_nz.jpg", "screen_name": "xi6", "lang": "ja", "profile_background_tile": false, "favourites_count": 4473, "name": "\u3055\u304f", "url": null, "created_at": "Thu Dec 17 16:55:25 +0000 2009", "contributors_enabled": false, "time_zone": "Tokyo", "profile_sidebar_border_color": "C6E2EE", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645047799808, "source": "<a href=\"http://tapbots.com/tweetbot\" rel=\"nofollow\">Tweetbot for iOS</a>"}
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "#ImSingleBecause lolz I'm not. Happily taken by @GarrettBettler <33 I love him, forever :)", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [{"indices": [48, 63], "screen_name": "GarrettBettler", "id": 460816116, "name": "Garrett Bettler", "id_str": "460816116"}], "hashtags": [{"indices": [0, 16], "text": "ImSingleBecause"}], "urls": []}, "in_reply_to_status_id_str": null, "id_str": "187179645039427584", "in_reply_to_screen_name": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/452188318/tanja_beach_2007_001.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1971847266/image_normal.jpg", "profile_sidebar_fill_color": "f6ffd1", "is_translator": false, "id": 461432420, "profile_text_color": "333333", "followers_count": 222, "protected": false, "location": "", "default_profile_image": false, "listed_count": 0, "utc_offset": null, "statuses_count": 2334, "description": "", "friends_count": 192, "profile_link_color": "0099CC", "profile_image_url": "http://a0.twimg.com/profile_images/1971847266/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "FFF04D", "id_str": "461432420", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/452188318/tanja_beach_2007_001.JPG", "screen_name": "LeahOswalt", "lang": "en", "profile_background_tile": false, "favourites_count": 86, "name": "Leah Oswalt", "url": null, "created_at": "Wed Jan 11 20:07:24 +0000 2012", "contributors_enabled": false, "time_zone": null, "profile_sidebar_border_color": "fff8ad", "default_profile": false, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645039427584, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"}
{"favorited": false, "in_reply_to_user_id": 434884235, "contributors": null, "truncated": false, "text": "@nomimushi_ttk \u3068\u30fc\u3084\u3082\u7d20\u6575\u3060\u3051\u3069\u306e\u307f\u3080\u3057\u306e\u30a2\u30a4\u30b3\u30f3\u5929\u4f7f\u3059\u304e\u3066", "created_at": "Tue Apr 03 14:07:59 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187179241664815105, "coordinates": null, "in_reply_to_user_id_str": "434884235", "entities": {"user_mentions": [{"indices": [0, 14], "screen_name": "nomimushi_ttk", "id": 434884235, "name": "\u306e\u307f\u3080\u3057", "id_str": "434884235"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187179241664815105", "id_str": "187179645026836480", "in_reply_to_screen_name": "nomimushi_ttk", "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/images/themes/theme1/bg.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/2023688835/6222d8ae387e6fb5b220895d2fd2d41a_normal.gif", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 365471550, "profile_text_color": "333333", "followers_count": 308, "protected": false, "location": "\u5b66\u5712\u30a2\u30ea\u30b9\u306b\u518d\u71b1\u306a\u3046", "default_profile_image": false, "listed_count": 17, "utc_offset": 32400, "statuses_count": 25562, "description": "\uff8c\uff9e\uff9a10/\u3046\u305f\u30d7\u30ea/HTF/\u3044\u306c\u307c\u304f\u306a\u3069\u306b\u304a\u71b1/\u5d50\u306e\u5927\u91ce\u304f\u3093\u3059\u304d\uff01\u64ec\u4eba\u5316\u3082\u3050\u3082\u3050/\u30a4\u30ca\u30a4\u30ec/RKRN/pkmn/\uff83\uff86\uff8c\uff9f\uff98/\u4e59\u5973\uff79\uff9e\uff70\u5168\u822c\u3082 [\u30bf\u30ab\u4e38\u3055\u3093\u30e2\u30b0\u30e2\u30b0\u30da\u30c3\u3063\u3066\u3057\u968a\u54e1No.2\uff3c\u526f\u968a\u9577\uff0f]\u3000\u898f\u5236\u57a2\u3010@ao_sanagi_2\u3011\u30a2\u30a4\u30b3\u30f3\u306f\u3068\u30fc\u3084\u304b\u3089\uff01", "friends_count": 284, "profile_link_color": "0084B4", "profile_image_url": "http://a0.twimg.com/profile_images/2023688835/6222d8ae387e6fb5b220895d2fd2d41a_normal.gif", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "C0DEED", "id_str": "365471550", "profile_background_image_url": "http://a0.twimg.com/images/themes/theme1/bg.png", "screen_name": "ao_sanagi", "lang": "ja", "profile_background_tile": false, "favourites_count": 1071, "name": "\u8475@\u6284\u82b1\u306e\u5ac1", "url": null, "created_at": "Wed Aug 31 14:07:23 +0000 2011", "contributors_enabled": false, "time_zone": "Tokyo", "profile_sidebar_border_color": "C0DEED", "default_profile": true, "following": null}, "place": null, "retweet_count": 0, "geo": null, "id": 187179645026836480, "source": "<a href=\"http://www.movatwi.jp\" rel=\"nofollow\">\u30e2\u30d0\u30c4\u30a4 / www.movatwi.jp .</a>"}
我的最终目标是计算特定单词在所有推文的“文本”字段中出现的次数。我尝试了多种不同的方法,取得了不同程度的成功,但我现在的处境是:
import fileinput
import json
import sys
import os
line = []
inputfilename = sys.argv[1]
for line in fileinput.input([inputfilename]):
tweettext = json.loads(line).get('text').split()
print tweettext
这会循环遍历文件,并将每行“文本”字段中的文本拆分为单独的单词,但不会创建单个单词列表。要在遇到空行时添加失败的问题:
[u'RT', u'@keenakan:', u'kamu', u'tidak', u'perlu', u'memperjuangkan', u'aku.', u'Yang', u'perlu', u'ialah', u'aku', u'dan', u'kamu', u'yang', u'memperjuangkan', u'kita.', u'-@commaditya']
[u'RT', u'@TheRealToxicBoi:', u'#LiesBeforeSex', u"I'll", u'be', u'Gentle!']
[u'@coliriostar', u'Quer', u'GANHAR', u'R$', u'300,00', u'em', u'vale', u'compra?', u'SIGA', u'@eucompronanet', u'e', u'saiba', u'como', u'participar,', u'\xe9', u'simples', u'e', u'r\xe1pido!', u'at\xe9', u'+', u'ci']
Traceback (most recent call last):
File "newexample.py", line 11, in <module>
tweettext = json.loads(line).get('text').split()
AttributeError: 'NoneType' object has no attribute 'split'
有人可以提出解决方案吗?
编辑:
根据第一条评论,我根据我的理解将代码编辑如下:
import fileinput
import json
import sys
import os
line = []
tw = 0
inputfilename = sys.argv[1]
for line in fileinput.input([inputfilename]):
line = line.strip();
if not line: continue
tweettext = json.loads(line).get('text')
if not json.loads(line).get('text'):
continue
words = tweettext.split()
print words
tw = len(words)
print "total number of words", tw
我的输出看起来更好了,至少我不再收到“属性错误:NoneType”。现在输出似乎由单独的字典组成,而不仅仅是一个大字典。我的目标再次是计算每个单词出现的次数,我不知道该怎么做,除非它们都在一个字典中。这是此时的输出示例:
[u'L', u'Lawliet', u'(Sweets', u'Addict)', u'+', u'Kenshin', u'Himura', u'(Samurai)', u'+', u'Kyon', u'(Lazy', u'and', u'Carefree', u'Bum)', u'=', u'Sakata', u'Gintoki', u'xD', u'May...', u'http://t.co/LD4E1j1v']
[u'Yay', u'~', u'I', u'have', u'ice~I', u'can', u'reach', u'the', u'ice', u'maker!', u'ch', u'sees', u'gaps', u'in', u'the', u'freezer', u'as', u'a', u'challenge', u'and', u"it's", u'usually', u'full', u'to', u'busting.', u'But', u'not', u'now', u'Haha!']
[u'Hoi']
[u'everyones', u'on', u'twitter.']
total number of words 429023
我猜我可能可以以某种方式为 for 循环中的每个单词设置计数器。?正如您所看到的,总字数工作正常,因为它添加了每行的单词数,但我不太明白如何确定独特的单词,例如:
len(set(words))
编辑:
这是我的最终解决方案:
import fileinput
import json
import sys
import os
from collections import defaultdict
line = []
tw = 0
inputfilename = sys.argv[1]
word_count = defaultdict(int)
for line in fileinput.input([inputfilename]):
line = line.strip();
if not line: continue
tweettext = json.loads(line).get('text')
if not json.loads(line).get('text'):
continue
words = tweettext.split()
tw += len(words)
for word in words:
word_count[word]+=1
print word_count
print "total number of words", tw
最佳答案
您似乎走在正确的轨道上,只需添加错误检查即可,例如
在将其加载为 json 之前检查行是否为空,同时删除该行以确保例如
line = line.strip();
if not line: continue
检查 json 数据中是否确实有任何文本
if not json.loads(line).get('text'):
continue
之后你应该循环遍历单词并可能创建一个字典,例如
word_count = defaultdict(int)
for line in file:
# get words and add them to dict
for word in words:
word_count[word]+=1
关于python - 如何在Python中读取多行JSON文件并统计特定字段的单词数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10232170/
我目前正在学习数据挖掘,有以下问题。 机器学习和数据挖掘之间有什么关系? 我发现许多数据挖掘技术都与统计相关,而我“听说”数据挖掘与机器学习有很多关系。所以我的问题是:机器学习与统计学密切相关吗? 如
我有很多表的数据,例如: event_id player finish 1 a 1 1 b 2 1 c
我对 http_status_module 提供的统计数据感兴趣 特别是上游部分的统计数据。 http://nginx.org/en/docs/http/ngx_http_status_module.
除了 Cluster MBean 之外,是否有任何可以在 Akka (Java) 中启用的内置 JMX 公开监控/统计信息?我看过 Typesafe Console,但由于它需要许可证才能用于从多个节
我正在尝试在我的程序中使用“usage”统计信息来获取类似于 time 的数据工具。但是,我很确定我做错了什么。这些值似乎是正确的,但有时可能有点奇怪。我没有在网上找到好的资源。有人知道如何做得更好吗
我有一个带有统计表的 MySQL 数据库。我想以年历、月度的形式输出数据。对于没有点击率的几个月,我想花费一个“空”DIV。有两个ID。 $query = mysqli_query($db,"SELE
设置: 问题是经典概率问题的复杂形式: 70 colored balls are placed in an urn, 10 for each of the seven rainbow colors.
有哪些 Ruby gem 可以执行数据处理? 最佳答案 我知道有 3 种从 Ruby 访问 R 的方法: RinRuby RSRuby 通过 Rserve-Ruby-Client 预约 RinRuby
背景 图像领域内的一个国内会议快要召开了,要发各种邀请邮件,之后要录入、统计邮件回复(参会还是不参会等)。如此重要的任务,老师就托付给我了。ps: 统计回复邮件的时候,能知道谁参会或谁不参会。
我正在添加用户输入的几个数字并将它们添加到数组列表中。 到目前为止我的代码: package project143; import java.util.*; /** * @author -- */
正如标题所示,我需要做的是在各种 iO/Android/Windows 应用程序中跟踪各种用户事件 - 例如点击、滑动、在页面上花费的时间等。 这些应用程序基于响应式 HTML/CSS/JS,并具有简
我希望计算 HTML 表中每个唯一值的实例数,并在其自己的表中返回结果。该表是根据用户的文本输入生成的。例如,用户输入可能如下所示: Report 46 Bob Marley 4/20/2
如何使用 PHP 计算数字数组的 z 分数?我需要计算 z 分数,然后找到百分位数 (CDF)!我可以使用哪些 PHP 函数?谢谢! 最佳答案 以下代码将给出 CDF 的良好近似值(Abramowit
我只是想知道是否可以计算 GitHub 上空存储库的总数。 如果不适合所有用户,可以为自己做吗? 编辑 我已经尝试过size:0搜索,但似乎返回了很多包含数据的存储库。采用 size:0..1 之类的
public class Scanner { private HtmlProcessor hp; private String baseUrl; private int ste
我正在使用 Mule ESB 3.4。我想开发一个自定义 Java 组件来计算流收到的请求数量。流程将例如像这样: http inbound-endpoint -> counter -> vm-out
我喜欢借助 GitHub API 来统计存储库中所有开放的拉取请求和问题。我发现 API 端点 /repos/:owner/:repo 结果包含 open_issues 属性。然而,这是问题和拉取请求
如何使用 PHP 计算数字数组的 z 分数?我需要计算 z 分数,然后找到百分位数 (CDF)!我可以使用哪些 PHP 函数?谢谢! 最佳答案 以下代码将给出 CDF 的良好近似值(Abramowit
已关闭。此问题需要 debugging details 。目前不接受答案。 编辑问题以包含 desired behavior, a specific problem or error, and the
我正在尝试以编程方式获取搜索字词列表的 Google 新闻搜索结果计数(即有多少个结果),但仅限于过去 1 年。使用用户界面搜索时,结果计数仅出现在常规搜索中,但在“工具 > 最近 > 过去一年”下时
我是一名优秀的程序员,十分优秀!