gpt4 book ai didi

python - pandas read_table 中的 usecols 结果为 "list index out of range"

转载 作者:太空宇宙 更新时间:2023-11-03 14:50:01 26 4
gpt4 key购买 nike

我想在用 pandas 解析一些数据时只选择 2 列。

pd.read_table的帮助提到了一个usecols选项,这似乎正是我想要的:

usecols : array-like, default None
Return a subset of the columns. All elements in this array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.

读取后,我的数据似乎包含编号为 0 到 6 的列:

In [338]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
...: col=3, header=None)[:3]
Out[338]:
0 1 2 4 5 6
3
WBGene00022277 I 4118 10230 - . 83
WBGene00022276 I 10412 16842 + . 230
WBGene00022278 I 17482 26781 - . 303

但是当我尝试仅保留索引(第 3 列)和最后一个索引(第 6 列)时,出现以下错误:

In [339]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
...: col=3, header=None, usecols=(3, 6))[:3]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-339-279bef505f16> in <module>()
----> 1 pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_col=3, header=None, usecols=(3, 6))[:3]

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
644 delim_whitespace=delim_whitespace,
645 as_recarray=as_recarray,
--> 646 warn_bad_lines=warn_bad_lines,
647 error_bad_lines=error_bad_lines,
648 low_memory=low_memory,

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
387 kwds['encoding'] = encoding
388
--> 389 compression = kwds.get('compression')
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
728
729 if dialect_val != provided:
--> 730 conflict_msgs.append((
731 "Conflicting values for '{param}': '{val}' was "
732 "provided, but the dialect specifies '{diaval}'. "

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
921 for arg in _deprecated_args:
922 parser_default = _c_parser_defaults[arg]
--> 923 msg = ("The '{arg}' argument has been deprecated "
924 "and will be removed in a future version."
925 .format(arg=arg))

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1445 cast_type = dtypes
1446
-> 1447 if self.na_filter:
1448 col_na_values, col_na_fvalues = _get_na_values(
1449 c, na_values, na_fvalues)

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
2812 msg = ('Expected %d fields in line %d, saw %d' %
2813 (col_len, row_num + 1, actual_len))
-> 2814 if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
2815 # see gh-13374
2816 reason = ('Error could possibly be due to quotes being '

IndexError: list index out of range

我在另一种情况下成功使用了 usecols 选项,但保留了原始文件中的一些 header 。

是什么导致了这里的问题?

编辑:header=None 显然不是问题

我可以解析不同格式的文件,而不保留 header ,并且 usecols 选项有效:

In [361]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/feature_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", skiprows
...: =2, index_col=0, header=None, usecols=[0, 6])[:3]
Out[361]:
6
0
WBGene00022277 72
WBGene00022276 222
WBGene00022278 302

最佳答案

我看起来它与index_col有关

读取文件后尝试设置索引:

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6)).set_index(3)[:3]

显然,在减少列后正在使用index_col。您选择两列,然后尝试选择第三列作为索引。

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6), index_col=0)[:3]

关于python - pandas read_table 中的 usecols 结果为 "list index out of range",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45943371/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com