gpt4 book ai didi

python - 根据解析的文本将多个 bool 列添加到数据框 - python

转载 作者:太空宇宙 更新时间:2023-11-04 05:16:44 25 4
gpt4 key购买 nike

我正在尝试通过基于“拆分器”解析选择列并将每个子字符串添加为列标题然后将每一行标记为“True”或不是每个新列如果子字符串位于初始拆分文本中。

我的问题是代码运行时间太长,希望能提供一些更有效的选项。

我正在使用的数据框大约有 12,700 行和大约 3,500 列。

代码如下:

def expand_df_col(df, col_name, splitter):

series = set(df[col_name].dropna())

new_columns = set()

for values in series:
new_columns = new_columns.union(set(values.split(splitter)))

df = pd.concat([df,pd.DataFrame(columns=new_columns)], axis=1)

for row in range(len(df)):
for text in str(df.loc[row, col_name]).split(splitter):
if text != "Not applicable":
df.loc[row, text] = True

return df

例如:

                      Test 1              Test 2  
0 Will this work Is this even legit
1 Maybe it will work nope
2 It probably will not work nope

应该变成:

                      Test 1              Test 2   not    It    it  will  \
0 Will this work Is this even legit NaN NaN NaN NaN
1 Maybe it will work nope NaN NaN True True
2 It probably will not work nope True True NaN True

Maybe Will this work probably
0 NaN True True True NaN
1 True NaN NaN True NaN
2 NaN NaN NaN True True

@Ted Petrou 提供的回复几乎让我明白了,但不完全是:

def expand_df_col_test(df, col_name, splitter):
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

df_melt = pd.melt(df_split, id_vars=col_name, var_name='count')

df_temp = pd.pivot_table(df_melt, index=col_name, columns='value', values='count', aggfunc=lambda x: True, fill_value=False)

df_temp = df_temp.reindex(df.index)

return df_temp

返回测试 df 为:

value                         It  Maybe   Will     it    not probably   this  \
Test 1
Will this work False False True False False False True
Maybe it will work False True False True False False False
It probably will not work True False False False True True False

value will work
Test 1
Will this work False True
Maybe it will work True True
It probably will not work True True

作为跟进,我进行了编辑。该函数适用于简单示例,但返回需要解析和扩展的原始列(如果存在 pd.pivot_table() 之后的代码),如果仅完成 pd.pivot_table() 部分,则返回空数据帧.

我一辈子都弄不明白(我花了一整天时间修补和阅读所涉及的各种功能)。

同样,我有大约 12K 行和 1-3K 列,不确定这是否/如何影响输出。

当前函数:

def expand_df_col_test(df, col_name, splitter, reindex_col):

import numpy as np

replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

df_split = pd.concat((df, df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

df_melt = pd.melt(df_split, id_vars=list(df.columns), var_name='count')

df_pivot = pd.pivot_table(df_melt,
index=list(df.columns),
columns=df_melt['value'],
values=df_melt['count'],
aggfunc=lambda x: True,
fill_value= np.nan).reset_index(reindex_col).reindex(df[col_name]).reset_index()

df_pivot.columns.name = ''

return df_pivot

我以为我找到了解决方案,但没有正确地重建索引。

现在这个函数在一个子集上工作,但我不断收到 ValueError: cannot reindex from a duplicate axis

def expand_df_col_test(df, col_name, splitter, reindex_col):

import numpy as np

sub_df = pd.concat([df[col_name],df[reindex_col]], axis=1)

replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).str.split(splitter, expand=True)), axis=1)

df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')

df_pivot = pd.pivot_table(df_melt,
index=list(sub_df.columns),
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value= np.nan)

print("pivot")
print(df_pivot)
print("NEXT RESET INDEX WITH REINDEX COL")
print(df_pivot.reset_index(reindex_col))
print("NEXT REINDEX")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]))
print("NEXT RESET INDEX()")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index())


df_pivot = df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index()

df_pivot.columns.name = ''

df_final = pd.concat([df,df_pivot.drop([col_name, reindex_col], axis=1)], axis = 1)

return df_final

最佳答案

更新答案#2

df_list = [df]
for col_name in df.columns:
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=[col_name], var_name='count')
df_list.append(pd.pivot_table(df_melt,
index=[col_name],
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value=np.nan).reindex(df[col_name]).reset_index(drop=True))
df_final = pd.concat(df_list, axis=1)

Test 1 Test 2 It Maybe Will it \
0 Will this work Is this even legit NaN NaN True NaN
1 Maybe it will work nope NaN True NaN True
2 It probably will not work nope True NaN NaN NaN

not probably this will work Is even legit nope this
0 NaN NaN True NaN True True True True NaN True
1 NaN NaN NaN True True NaN NaN NaN True NaN
2 True True NaN True True NaN NaN NaN True NaN

更新的答案

看来这个答案与上一个答案之间的唯一区别是您要保留一个额外的列测试 2。以下将完成此操作:

splitter = ' '
df_split = pd.concat((df, df['Test 1'].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=['Test 1', 'Test 2'], var_name='count')
df_pivot = pd.pivot_table(df_melt,
index=['Test 1', 'Test 2'],
columns='value',
values='count',
aggfunc=lambda x: True,
fill_value=np.nan)\
.reset_index('Test 2')\
.reindex(df['Test 1'])\
.reset_index()

df_pivot.columns.name = ''

Test 1 Test 2 It Maybe Will it \
0 Will this work Is this even legit NaN NaN True NaN
1 Maybe it will work nope NaN True NaN True
2 It probably will not work nope True NaN NaN NaN

not probably this will work
0 NaN NaN True NaN True
1 NaN NaN NaN True True
2 True True NaN True True

旧答案

您需要提供带有示例结果的示例 DataFrame 以获得更好更快的答案。这是黑暗中的一枪。我将首先提供一个带有一些假数据的示例 DataFrame 并尝试提供解决方案。

# create fake data
df = pd.DataFrame({'col1':['here is some text', 'some more text', 'finally some different text']})

df 的输出

                          col1
0 here is some text
1 some more text
2 finally some different text

用拆分器拆分 col1 中的每个值(这里将是一个空格)

col_name = 'col1'
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

df_split 的输出

                          col1        0     1          2     3
0 here is some text here is some text
1 some more text some more text None
2 finally some different text finally some different text

将所有拆分放在一列中

df_melt = pd.melt(df_split, id_vars='col1', var_name='count')

df_melt 的输出

                           col1 count      value
0 here is some text 0 here
1 some more text 0 some
2 finally some different text 0 finally
3 here is some text 1 is
4 some more text 1 more
5 finally some different text 1 some
6 here is some text 2 some
7 some more text 2 text
8 finally some different text 2 different
9 here is some text 3 text
10 some more text 3 None
11 finally some different text 3 text

最后,旋转上面的 DataFrame,使列为拆分词

pd.pivot_table(df_melt, index='col1', columns='value', values='count', aggfunc=lambda x: True, fill_value=False)

输出

value                       different finally   here     is   more  some  text
col1
finally some different text True True False False False True True
here is some text False False True True False True True
some more text False False False False True True True

关于python - 根据解析的文本将多个 bool 列添加到数据框 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41533822/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com