- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
这是我的代码。
数据的形状:
data_dict.items()
Out[57]:
[('Sympathetic', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('protest', defaultdict(<type 'int'>, {'2011-10-06': 16})),
('occupycanada', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('hating', defaultdict(<type 'int'>, {'2011-10-06': 1})),
('AND', defaultdict(<type 'int'>, {'2011-10-06': 4})),
('c', defaultdict(<type 'int'>, {'2011-10-06': 2})),
...]
data_dict 定义为
data_dict = defaultdict(lambda: defaultdict(int))
我想构造一个数据框,如下所示:
columns = ['word','date',"number"]
word date number
"Sympathetic" '2011-10-06' 1
"protest" '2011-10-06' 16
'occupycanada' '2011-10-06' 1
'hating' '2011-10-06' 1
'AND' '2011-10-06' 4
'comunity' '2011-10-06' 2
...
我尝试使用 pandas 这样做:
import pandas as pd
for d in data_dict:
for date in data_dict[d]:
data=[d,date,data_dict[d][date]]
dat = pd.DataFrame(data, columns = ['word','date',"number"])
print dat
但是当我运行此代码时,出现以下错误:
ValueError Traceback (most recent call last)
<ipython-input-56-80b3affa34fe> in <module>()
3 for date in data_dict[d]:
4 data=[d,date,data_dict[d][date]]
----> 5 dat = pd.DataFrame(data, columns = ['word','date',"number"])
6 print dat
....
ValueError: Shape of passed values is (1, 3), indices imply (3, 3)
我该如何解决这个问题?
有关 data_dict 的附加代码:
from collections import defaultdict
import csv
import re
import sys
def flushPrint(s):
sys.stdout.write('\r')
sys.stdout.write('%s' % s)
sys.stdout.flush()
data_dict = defaultdict(lambda: defaultdict(int))
error_num = 0
line_num = 0
total_num = 0
bigfile = open('D:/Data/ows/ows_sample.txt', 'rb')
chunkSize = 10000000
chunk = bigfile.readlines(chunkSize)
while chunk:
total_num += len(chunk)
lines = csv.reader((line.replace('\x00','') for line in chunk), delimiter=',', quotechar='"')
for i in lines:
line_num+=1
if line_num%1000000==0:
flushPrint(line_num)
try:
i[1]= re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+|(?:@[\w_]+)', "", i[1])
tweets=re.split(r"\W+",i[1])
date=i[3]
for word in tweets: # error
if len(date)==10:
data_dict[word][date] += 1
except Exception, e:
print e
error_num+=1
pass
chunk = bigfile.readlines(chunkSize)
print line_num, total_num,error_num
示例数据
['"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n',
'121813144174727168,"RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!",,2011-10-06,5,4,"2011-10-06 05:04:51",N;,Anonops_Cop,401240477,en,,0,"<a href=">web</a>"\n',
'121813146137657344,"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester 2011-10-06,5,4,"2011-10-06 05:04:51",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,"<a href=">web</a>"\n',
'121813150000619521,"@Seductivpancake Right! Those guys have a victory condition: regime change. #ows doesn\'t seem to have a goal I can figure out.",2011-10-06,5,4,"2011-10-06 05:04:52",N;,nerdsherpa,95067344,en,Seductivpancake,19695580,"<a href="nofollow">Echofon</a>"\n',
'121813150701072385,"RT @bembel "Occupy Wall Street" als linke Antwort auf die Tea Party? #OccupyWallStreet #OWS",2011-10-06,5,4,"2011-10-06 05:04:52",N;,hamudistan,35862923,en,,0,"<a href="rel="nofollow">Plume\xc2\xa0\xc2\xa0</a>"\n',
'121813163778899968,"#ows White shirt= Brown shirt.",2011-10-06,5,4,"2011-10-06 05:04:56",N;,kl_knox,419580636,en,,0,"<a href=">web</a>"\n',
'121813169999065088,"RT @TheNewDeal: The #NYPD are Out of Control. Is This a Free Country or a Middle-East Dictatorship? #OccupyWallStreet #OWS #p2",2011-10-06,5,4,"2011-10-06 05:04:57",N;,vickycrampton,32151083,en,,0,"<a href=">web</a>"\n',
最佳答案
我会这样做:
# -*- coding: utf-8 -*-
from collections import defaultdict, Counter
import string
import pandas as pd
# prepare translate table, which will remove all punctuations and digits
chars2remove = list(string.punctuation + string.digits)
transl_tab = str.maketrans(dict(zip(chars2remove, list(' ' * len(chars2remove)))))
# replace 'carriage return' and 'new line' characters with spaces
transl_tab[10] = ' '
transl_tab[13] = ' '
def tokenize(s):
return s.translate(transl_tab).lower().split()
chunksize = 100
fn = r'D:\temp\.data\ows-sample.txt'
#
# read `Day` and `Text` columns from the source CSV file
#
# not-chunked version
#df = pd.read_csv(fn, usecols=['Text','Day'])
# "chunked" version - will prepare a list of "reduced" DFs,
# containing word counts in the form: "{'we': 1, 'stand': 1, 'and': 1}"
dfs = []
for df in pd.read_csv(fn, usecols=['Text','Day'], chunksize=chunksize):
# group DF by date and count words for each unique day, summing up counters
dfs.append(df.assign(count=df['Text']
.apply(lambda x: Counter(tokenize(x))))
.groupby('Day', as_index=False)['count'].sum()
)
# convert sets of {'word1': count, 'word2': count} into columns
tmp = (pd.concat(dfs, ignore_index=True)
.set_index('Day')['count']
.apply(pd.Series)
.reset_index()
)
tmp['Day'] = pd.to_datetime(tmp['Day'])
# free up memory
del dfs
# transform (melt) columns into desired columns: [Day, word, number]]
rslt = (pd.melt(tmp, id_vars='Day', var_name='word', value_name='number')
.fillna(0)
)
# delete temporary DF from memory
del tmp
# save results as HDF5 file
rslt.to_hdf('d:/temp/.data/twit_words.h5', 'twit_words', mode='a',
format='t', complib='zlib', complevel=4)
# save results as CSV file
rslt.to_csv('d:/temp/.data/twit_words.csv.gz', index=False,
encoding='utf_8', compression='gzip')
针对 this 进行测试样本数据:
In [254]: pd.melt(new, id_vars='Day', var_name='word', value_name='number').fillna(0)
Out[254]:
Day word number
0 2011-11-13 a 4.0
1 2011-11-14 a 9.0
2 2011-11-15 a 92.0
3 2011-11-16 a 111.0
4 2011-11-17 a 93.0
5 2011-11-18 a 141.0
6 2011-11-19 a 77.0
7 2011-11-20 a 58.0
8 2011-11-21 a 29.0
9 2011-11-22 a 70.0
10 2011-11-23 a 55.0
11 2011-11-24 a 49.0
12 2011-11-25 a 41.0
13 2011-11-26 a 67.0
14 2011-11-27 a 27.0
15 2011-11-28 a 34.0
16 2011-11-29 a 23.0
17 2011-11-30 a 33.0
18 2011-12-01 a 26.0
19 2011-12-02 a 32.0
20 2011-12-03 a 46.0
21 2011-12-04 a 29.0
22 2011-12-05 a 22.0
23 2011-12-06 a 60.0
24 2011-12-07 a 32.0
25 2011-12-08 a 33.0
26 2011-12-09 a 16.0
27 2011-11-13 aa 0.0
28 2011-11-14 aa 0.0
29 2011-11-15 aa 0.0
... ... ... ...
关于python - # pandas DataFrame ValueError : Shape of passed values is (1, 3),索引意味着 (3, 3),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37152828/
我正在尝试使用 flot 绘制 SQL 数据库中的数据图表,这是使用 php 收集的,然后使用 json 编码的。 目前看起来像: [{"month":"February","data":482},
我有一个来自 php 行的 json 结果,类似于 ["value"]["value"] 我尝试使用内爆函数,但得到的结果是“value”“value” |id_kategori|created_at
脚本 1 将记录 two 但浏览器仍会将 select 元素呈现为 One。该表单还将提交值 one。 脚本 2 将记录、呈现和提交 两个。我希望它们是同义词并做同样的事情。请解释它们为何不同,以及我
我的python字典结构是这样的: ips[host][ip] 每行 ips[host][ip] 看起来像这样: [host, ip, network, mask, broadcast, mac, g
在 C# 中 我正在关注的一本书对设置和获取属性提出了这样的建议: double pri_test; public double Test { get { return pri_test; }
您可能熟悉 enum 位掩码方案,例如: enum Flags { FLAG1 = 0x1, FLAG2 = 0x2, FLAG3 = 0x4, FLAG4 = 0x8
在一些地方我看到了(String)value。在一些地方value.toString() 这两者有什么区别,在什么情况下我需要使用哪一个。 new Long(value) 和 (Long)value
有没有什么时候 var result = !value ? null : value[0]; 不会等同于 var result = value ? value[0] : null; 最佳答案 在此处将
我正在使用扫描仪检测设备。目前,我的条形码的值为 2345345 A1。因此,当我扫描到记事本或文本编辑器时,输出将类似于 2345345 A1,这是正确的条形码值。 问题是: 当我第一次将条形码扫描
我正在读取 C# 中的资源文件并将其转换为 JSON 字符串格式。现在我想将该 JSON 字符串的值转换为键。 例子, [ { "key": "CreateAccount", "text":
我有以下问题: 我有一个数据框,最多可能有 600 万行左右。此数据框中的一列包含某些 ID。 ID NaN NaN D1 D1 D1 NaN D1 D1 NaN NaN NaN NaN D2 NaN
import java.util.*; import java.lang.*; class Main { public static void main (String[] args) thr
我目前正在开发我的应用程序,使其设计基于 Holo 主题。在全局范围内我想做的是工作,但我对文件夹 values、values-v11 和 values-v14. 所以我知道: values 的目标是
我遇到了一个非常奇怪的问题。 我的公司为我们的各种 Assets 使用集中式用户注册网络服务。我们一般通过HttpURLConnection使用请求方法GET向Web服务发送请求,通过qs设置参数。这
查询: UPDATE nominees SET votes = ( SELECT votes FROM nominees WHERE ID =1 ) +1 错误: You can't specify
如果我运行一段代码: obj = {}; obj['number'] = 1; obj['expressionS'] = 'Sin(0.5 * c1)'; obj['c
我正在为我的应用创建一个带有 Twitter 帐户的登录页面。当我构建我的项目时会发生上述错误。 values/strings.xml @dimen/abc_text_size_medium
我在搜索引擎中使用以下 View : CREATE VIEW msr_joined_view AS SELECT table1.id AS msr_id, table1.msr_number, tab
为什么验证会返回此错误。如何解决? ul#navigation li#navigation-3 a.current Value Error : background-position Too
我有一个数据名如下 import pandas as pd d = { 'Name' : ['James', 'John', 'Peter', 'Thomas', 'Jacob', 'Andr
我是一名优秀的程序员,十分优秀!