gpt4 book ai didi

python - 如何从数据框 Pandas 制作列表列表?

转载 作者:行者123 更新时间:2023-11-28 19:33:54 24 4
gpt4 key购买 nike

我有一个带有单词和标签的 Pandas 数据框

  words   tags
0 I WW
1 am XX
2 newbie YY
3 . ZZ
4 You WW
5 are XX
6 cool YY
7 . ZZ

有什么方法可以从数据框中创建列表列表,如下所示:

[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.','ZZ')], 
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.','ZZ')]]

它是元组列表的列表。对于每一个列表,里面的列表都是用('.','ZZ')分隔的。意思是它是一个句子。

我可以迭代数据框的每一行并创建列表并在条件为真时附加它,但是有没有“ Pandas ”方法来解决它?

最佳答案

如果性能很重要,您可以先从所有值创建元组,然后将它们分成子列表:

from  itertools import groupby

L = list(zip(df['words'], df['tags']))
print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'),
('.', 'ZZ'), ('You', 'WW'), ('are', 'XX'),
('cool', 'YY'), ('.', 'ZZ')]

sep = ('.','ZZ')
new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
print (new_L)

[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]

时间:

df = pd.concat([df]*1000).reset_index(drop=True)

def zero(df):
dft = df.apply(tuple, 1)
return ([x.values.tolist() for _, x in dft.groupby((dft == ('.', 'ZZ')).shift().cumsum().bfill())])

In [55]: %timeit ([list(g) + [('.','ZZ')] for k, g in groupby(list(zip(df['words'], df['tags'])), lambda x: x==('.','ZZ')) if not k] )
100 loops, best of 3: 4.14 ms per loop

def pir(df):
v = df.values
return ([list(map(tuple, x)) for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)])

In [68]: %timeit (pir(df))
10 loops, best of 3: 21.9 ms per loop


In [56]: %timeit (zero(df))
1 loop, best of 3: 328 ms per loop

In [57]: %timeit (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(lambda group: list(zip(group['words'], group['tags']))).values.tolist())
1 loop, best of 3: 286 ms per loop

In [58]: %timeit (list(filter(None,[i.apply(tuple,1).values.tolist() for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)])))
1 loop, best of 3: 1.31 s per loop

对于我创建的与子列表分开的问题,您可以查看解决方案here :

def jez_coldspeed(df):
L = list(zip(df['words'], df['tags']))
L2 = []
for i in L[::-1]:
if i == ('.','ZZ'):
L2.append([])

L2[-1].append(i)

return [x[::-1] for x in L2[::-1]]

def jez_coldspeed1(df):
L = list(zip(df['words'], df['tags']))
L2 = []
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
L2.append([])

L2[-1].append(i)

return [x[::-1] for x in reversed(L2)]


In [74]: %timeit (jez_coldspeed(df))
100 loops, best of 3: 2.96 ms per loop

In [75]: %timeit (jez_coldspeed1(df))
100 loops, best of 3: 2.95 ms per loop

def jez_theBuzzyCoder(df):
L = list(zip(df['words'], df['tags']))
a = list()
start = 0
sep = ('.', 'ZZ')

while start < len(L) and (L.index(sep, start) != -1):
end = L.index(sep, start) + 1
a.append(L[start:end])
start = end
return a


print (jez_theBuzzyCoder(df))

In [81]: %timeit (jez_theBuzzyCoder(df))
100 loops, best of 3: 3.16 ms per loop

关于python - 如何从数据框 Pandas 制作列表列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46499582/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com