gpt4 book ai didi

python - 如何查找制表符分隔文件中的列数

转载 作者:行者123 更新时间:2023-11-28 17:36:30 24 4
gpt4 key购买 nike

我有一个制表符分隔的文件,其中包含 10 亿行(假设有 200 多列而不是 3 列):

abc -0.123  0.6524  0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232

如果列数未知,如何找到制表符分隔文件中的列数?

我已经试过了:

import io
with io.open('bigfile', 'r') as fin:
num_columns = len(fin.readline().split('\t'))

并且(来自@EdChum,Read a tab separated file with first column as key and the rest as values):

import pandas as pd
num_columns = pd.read_csv('bigfile', sep='\s+', nrows=1).shape[1]

我还能如何获得列数?哪种方法最有效?(假设我突然收到一个列数未知的文件,比如超过 100 万列)

最佳答案

在一个有 100000 列的文件上的一些计时,计数似乎最快但差了一个:

In [14]: %%timeit                    
with open("test.csv" ) as f:
r = csv.reader(f, delimiter="\t")
len(next(r))
....:
10 loops, best of 3: 88.7 ms per loop

In [15]: %%timeit
with open("test.csv" ) as f:
next(f).count("\t")
....:
100 loops, best of 3: 11.9 ms per loop
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
....:
10 loops, best of 3: 133 ms per loop

实际上使用 str.translate 似乎是最快的,尽管您还需要加 1:

In [5]: %%timeit
with open("test.csv" ) as f:
n = next(f)
(len(n) - len(n.translate(None, "\t")))
...:
100 loops, best of 3: 9.9 ms per loop

Pandas 解决方案给我一个错误:

in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7977)()

StopIteration:

使用 readline 会增加更多开销:

In [19]: %%timeit
with open("test.csv" ) as f:
f.readline().count("\t")
....:
10 loops, best of 3: 28.9 ms per loop
In [30]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(fin.readline().split('\t'))
....:
10 loops, best of 3: 136 ms per loop

使用 python 3.4 的不同结果:

In [7]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
...:
10 loops, best of 3: 102 ms per loop

In [8]: %%timeit
with open("test.csv" ) as f:
f.readline().count("\t")
...:

100 loops, best of 3: 12.7 ms per loop
In [9]:
In [9]: %%timeit
with open("test.csv" ) as f:
next(f).count("\t")
...:
100 loops, best of 3: 11.5 ms per loop
In [10]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
....:
10 loops, best of 3: 89.9 ms per loop
In [11]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(fin.readline().split('\t'))
....:
10 loops, best of 3: 92.4 ms per loop
In [13]: %%timeit
with open("test.csv" ) as f:
r = csv.reader(f, delimiter="\t")
len(next(r))
....:
10 loops, best of 3: 176 ms per loop

关于python - 如何查找制表符分隔文件中的列数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29922108/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com