gpt4 book ai didi

python - 将 dask 系列连接到数据帧时出错

转载 作者:行者123 更新时间:2023-11-28 19:12:10 25 4
gpt4 key购买 nike

我有一个多 dask 核心系列,我想将其合并到一个数据帧中,以进一步写入一个 csv 文件,我该怎么做。我在尝试执行相同操作时遇到以下错误,请指教...

数据

1,2014-04-07T10:51:09.277Z,214536502,0
1,2014-04-07T10:54:09.868Z,214536500,0
1,2014-04-07T10:54:46.998Z,214536506,0
1,2014-04-07T10:57:00.306Z,214577561,0
2,2014-04-07T13:56:37.614Z,214662742,0
2,2014-04-07T13:57:19.373Z,214662742,0
2,2014-04-07T13:58:37.446Z,214825110,0
2,2014-04-07T13:59:50.710Z,214757390,0
2,2014-04-07T14:00:38.247Z,214757407,0
2,2014-04-07T14:02:36.889Z,214551617,0

代码

import dask
import datetime as dt
clicksdat = dd.read_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks100.dat', names=['Sid','Timestamp','itemid','itemcategory'], dtype={'sid':np.int64,'timestamp':np.object,'itemid':np.object,'itemcategory':np.object})
clicksdat['Timestamp']=clicksdat.Timestamp.apply(pd.to_datetime)
segment = ['EM']*24
segment[7:10] = ['M']*3
segment[10:13] = ['A']*3
segment[13:18] = ['E']*5
segment[18:23] = ['N']*5
segment[23] = 'MN'

maxtemp=clicksdat.groupby('Sid')['Timestamp'].max()
mintemp=clicksdat.groupby('Sid')['Timestamp'].min()
duration=(maxtemp.sub(mintemp).apply(lambda x: x.total_seconds() ))
day=maxtemp.apply(lambda x: x.day )
month=maxtemp.apply(lambda x: x.month)
noofnavigations=[clicksdat.groupby('Sid').count().Timestamp][0]
totalitems=clicksdat.groupby('Sid')['itemid'].nunique()
totalcats=clicksdat.groupby('Sid')['itemcategory'].nunique()
timesegment= maxtemp.apply(lambda x: segment[x.hour])
segmentchange=((maxtemp.apply(lambda x: segment[x.hour])!=mintemp.apply(lambda x: segment[x.hour])))
purchased=(clicksdat['Sid'].unique()).apply(lambda x: x in buyersession)

print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased))
#percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange },index=noofnavigations.index)
percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1)
percentile_list.to_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks1001-727.csv')

错误

(<class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-121-ad7fc3cf8839> in <module>()
25 print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased))
26 #percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange },index=noofnavigations.index)
---> 27 percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1)
28
29 percentile_list.to_csv('C:\Users\TG\Downloads\yoochoose-dataFull\yoochoose-clicks1001-727.csv')

C:\Users\TG\Anaconda3\envs\dato-env\lib\site-packages\dask\dataframe\multi.pyc in concat(dfs, axis, join, interleave_partitions)
576 else:
577 if axis == 1:
--> 578 raise ValueError('Unable to concatenate DataFrame with unknown '
579 'division specifying axis=1')
580 else:

ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1

最佳答案

首先 - 你的代码没有运行 - 因为有一些 undefined reference (dd,np)。因此,如果不投入不必要的时间,我无法重现您的问题。
但是由于我有类似的问题,我有一个想法:尝试为您的数据框设置索引。 (在我的例子中,只要有一个有效的索引,一切都可以正常工作。但是使用 .drop_duplicates() 以某种方式破坏了索引或分区,我遇到了和你一样的错误)

关于python - 将 dask 系列连接到数据帧时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38627111/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com