gpt4 book ai didi

Splitting a Dataframe at NaN row(在NaN行拆分数据帧)

转载 作者:bug小助手 更新时间:2023-10-28 10:13:27 27 4
gpt4 key购买 nike



There is already an answer that deals with a relatively simple dataframe that is given here.

已经有了一个处理这里给出的相对简单的数据帧的答案。



However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.

然而,我手头的这个框架有多个列和大量的行。一个Dataframe包含三个沿axis=0连接的Dataframe。(一个的底端连接到下一个的顶部。)它们由一行NaN值分隔。



How can I create three dataframes out of this one data by splitting it along the NaN rows?

我如何通过沿NAN行拆分数据来从这一个数据中创建三个数据帧?



This is the DataFrame. I intend to split it into three along the NaN rows.


更多回答
优秀答案推荐

Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.

就像在你链接的答案中一样,你想创建一个列来标识组号。然后,您可以使用相同的解决方案。


To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:

要做到这一点,您必须测试一行的所有值是否为NaN。我不知道熊猫体内是否有这样的测试成分,但熊猫有一个测试来检查系列中是否充满了NaN。因此,您要做的是在数据帧的转置上执行该操作,这样您的“系列”实际上就是您的行:


df["group_no"] = df.isnull().all(axis=1).cumsum()

At that point you can use the same technique from that answer to split the dataframes.

在这一点上,您可以使用与答案相同的技术来拆分数据帧。


You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.

您可能希望在最后执行.dropna(),因为您的结果中仍然有NaN行。



Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:

在2022年也遇到了同样的问题。以下是我使用NAN在各行上拆分数据帧的方法,需要注意的是,这依赖于PIP安装的python-rle来进行运行长度编码:


import rle

def nanchucks(df):
# It chucks NaNs outta dataframes

# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)

df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1

# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]

indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))

return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]

Examples:

例如:



sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})

sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})

print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]

print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]


Improving other answer that support multiple rows with NaNs:

改进支持使用NAS的多行的其他答案:


from IPython.display import display
import pandas as pd

def split_df_if_row_full_nans(df, reset_header=False):
# grouping
df = (df
.assign(_nan_all_cols=df.isnull().all(axis=1))
.assign(_group_no=lambda df_: df_._nan_all_cols.cumsum())
.query('_nan_all_cols == False') # Drop rows where _nan_all_cols is True
.drop(columns=['_nan_all_cols']) # Drop the _nan_all_cols column
.reset_index(drop=True)
)

# splitting
dfs = {df.iloc[rows[0],0]: (df
.iloc[rows]
.drop(columns=['_group_no'])
)
for _, rows in df.groupby('_group_no').groups.items()}

if reset_header:
# rename column and set index
for k, v in dfs.items():
dfs[k] = (v
.rename(columns=v.iloc[0])
.drop(index=v.index[0])
)
# TODO: this part seems to only works if length of the df is > 1
# dfs[k].set_index(dfs[k].columns[0], drop=True, inplace=True)

# # display
# for df in dfs.values():
# display(df)

return dfs

sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})

sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})

for df in split_df_if_row_full_nans(sample_df1).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 3.0 3.0 3.0
# 3 4 4 4

for df in split_df_if_row_full_nans(sample_df2).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 2 NaN 3 NaN
# 3 3 NaN 3
# 4 4 4 4

NOTE: This approach use .isnull().all(axis=1), that is only split if all value is NaN.

注意:此方法使用.isull().all(轴=1),即仅当所有值为NaN时才拆分。


更多回答

No need to transpose, just do df.isnull().all(axis=1).cumsum().

不需要转置,只需执行df.isnull().all(axis=1).cumsum()。

Ah, silly me! I was looking if isnull takes an axis parameter and didn't check all :-)

啊,我太傻了!我正在查看isull是否接受了轴参数,但没有全部选中:-)

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com