gpt4 book ai didi

python - 读取制表符分隔的文件,第一列作为键,其余列作为值

转载 作者:太空狗 更新时间:2023-10-29 20:56:02 24 4
gpt4 key购买 nike

我有一个包含 10 亿行 的制表符分隔文件(假设有 200 列,而不是 3 列):

abc -0.123  0.6524  0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232

我想创建一个字典,其中第一列中的字符串是键,其余是值。我一直这样做,但计算量很大:

import io

dictionary = {}

with io.open('bigfile', 'r') as fin:
for line in fin:
kv = line.strip().split()
k, v = kv[0], kv[1:]
dictionary[k] = list(map(float, v))

我还能如何获得所需的字典?实际上,对于该值,numpy 数组比 float 列表更合适。

最佳答案

可以使用pandas加载df,然后根据需要构造一个新的df,然后调用to_dict:

In [99]:

t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

编辑

一种更动态的方法,可以减少构建临时 df 的需要:

In [121]:

t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

进一步更新

实际上你不需要第一次读取,列长度可以通过第一列中的列数隐式导出:

In [128]:

t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

关于python - 读取制表符分隔的文件,第一列作为键,其余列作为值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29920440/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com