Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
就像在你链接的答案中一样,你想创建一个列来标识组号。然后,您可以使用相同的解决方案。
To do so, you have to make a test for all the values of a row to be NaN
. I don't know if there is such a test builtin in pandas
, but pandas
has a test to check if a Series is full of NaN
. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series
" is actually your row:
要做到这一点,您必须测试一行的所有值是否为NaN。我不知道熊猫体内是否有这样的测试成分,但熊猫有一个测试来检查系列中是否充满了NaN。因此,您要做的是在数据帧的转置上执行该操作,这样您的“系列”实际上就是您的行:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
在这一点上,您可以使用与答案相同的技术来拆分数据帧。
You might want to do a .dropna()
at the end, because you will still have the NaN
rows in your result.
您可能希望在最后执行.dropna(),因为您的结果中仍然有NaN行。
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle
for run-length encoding:
在2022年也遇到了同样的问题。以下是我使用NAN在各行上拆分数据帧的方法,需要注意的是,这依赖于PIP安装的python-rle来进行运行长度编码:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
例如:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]
Improving other answer that support multiple rows with NaNs
:
改进支持使用NAS的多行的其他答案:
from IPython.display import display
import pandas as pd
def split_df_if_row_full_nans(df, reset_header=False):
# grouping
df = (df
.assign(_nan_all_cols=df.isnull().all(axis=1))
.assign(_group_no=lambda df_: df_._nan_all_cols.cumsum())
.query('_nan_all_cols == False') # Drop rows where _nan_all_cols is True
.drop(columns=['_nan_all_cols']) # Drop the _nan_all_cols column
.reset_index(drop=True)
)
# splitting
dfs = {df.iloc[rows[0],0]: (df
.iloc[rows]
.drop(columns=['_group_no'])
)
for _, rows in df.groupby('_group_no').groups.items()}
if reset_header:
# rename column and set index
for k, v in dfs.items():
dfs[k] = (v
.rename(columns=v.iloc[0])
.drop(index=v.index[0])
)
# TODO: this part seems to only works if length of the df is > 1
# dfs[k].set_index(dfs[k].columns[0], drop=True, inplace=True)
# # display
# for df in dfs.values():
# display(df)
return dfs
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
for df in split_df_if_row_full_nans(sample_df1).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 3.0 3.0 3.0
# 3 4 4 4
for df in split_df_if_row_full_nans(sample_df2).values():
display(df)
# 1.0 1.0 1.0
# 1 2 2 2
# 2 NaN 3 NaN
# 3 3 NaN 3
# 4 4 4 4
NOTE: This approach use .isnull().all(axis=1)
, that is only split if all value is NaN
.
注意:此方法使用.isull().all(轴=1),即仅当所有值为NaN时才拆分。
更多回答
No need to transpose, just do df.isnull().all(axis=1).cumsum()
.
不需要转置,只需执行df.isnull().all(axis=1).cumsum()。
Ah, silly me! I was looking if isnull
takes an axis
parameter and didn't check all
:-)
啊,我太傻了!我正在查看isull是否接受了轴参数,但没有全部选中:-)
我是一名优秀的程序员,十分优秀!