gpt4 book ai didi

python - Pandas 从分解数据框中获取字符串标签

转载 作者:太空宇宙 更新时间:2023-11-04 05:14:02 25 4
gpt4 key购买 nike

我分解了我的 pandas 数据框列,但覆盖了原始列值。

有什么办法可以取回原始映射值以供引用?

例子:

df_test = pd.DataFrame({'col1': pd.Series(['cat','dog','cat','mouse'])})
df_test['col1'] = pd.factorize(df_test['col1'])[0]
df_test

enter image description here

但是我希望能够再次调用下面的代码来检查整数映射到什么。有什么方法可以在不重新初始化数据帧的情况下检查映射?

pd.factorize(df_test)[1]

最佳答案

我建议您使用稍微不同的方法 - 使用 categorical dtype :

In [40]: df_test['col1'] = df_test['col1'].astype('category')

In [41]: df_test
Out[41]:
col1
0 cat
1 dog
2 cat
3 mouse

In [42]: df_test.dtypes
Out[42]:
col1 category
dtype: object

如果你需要数字:

In [44]: df_test['col1'].cat.codes
Out[44]:
0 0
1 1
2 0
3 2
dtype: int8

400K DataFrame 的内存使用情况:

In [74]: df_test = pd.DataFrame({'col1': pd.Series(['cat','dog','cat','mouse'])})

In [75]: df_test = pd.concat([df_test] * 10**5, ignore_index=True)

In [76]: df_test.shape
Out[76]: (400000, 1)

In [77]: d1 = df_test.copy()

In [78]: d2 = df_test.copy()

In [79]: d1.col1 = pd.factorize(d1.col1)[0]

In [80]: d2.col1 = d2.col1.astype('category')

In [81]: df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 1 columns):
col1 400000 non-null object
dtypes: object(1)
memory usage: 3.1+ MB

In [82]: d1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 1 columns):
col1 400000 non-null int64
dtypes: int64(1)
memory usage: 3.1 MB

In [83]: d2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 1 columns):
col1 400000 non-null category
dtypes: category(1)
memory usage: 390.7 KB # categorical column takes almost 8x times less memory

关于python - Pandas 从分解数据框中获取字符串标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42211304/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com