python - 根据解析的文本将多个 bool 列添加到数据框 - python

转载作者：太空宇宙更新时间：2023-11-04 05:16:44

我正在尝试通过基于“拆分器”解析选择列并将每个子字符串添加为列标题然后将每一行标记为“True”或不是每个新列如果子字符串位于初始拆分文本中。

我的问题是代码运行时间太长，希望能提供一些更有效的选项。

我正在使用的数据框大约有 12,700 行和大约 3,500 列。

代码如下:

def expand_df_col(df, col_name, splitter):

     series = set(df[col_name].dropna())

     new_columns = set()

     for values in series:
         new_columns = new_columns.union(set(values.split(splitter)))

     df = pd.concat([df,pd.DataFrame(columns=new_columns)], axis=1)

     for row in range(len(df)):
         for text in str(df.loc[row, col_name]).split(splitter):
             if text != "Not applicable":
                 df.loc[row, text] = True

     return df

例如:

                      Test 1              Test 2  
0             Will this work  Is this even legit  
1         Maybe it will work                nope  
2  It probably will not work                nope

应该变成:

                      Test 1              Test 2   not    It    it  will  \
0             Will this work  Is this even legit   NaN   NaN   NaN   NaN   
1         Maybe it will work                nope   NaN   NaN  True  True   
2  It probably will not work                nope  True  True   NaN  True   

    Maybe  Will  this  work probably  
0   NaN  True  True  True      NaN  
1  True   NaN   NaN  True      NaN  
2   NaN   NaN   NaN  True     True

@Ted Petrou 提供的回复几乎让我明白了，但不完全是:

def expand_df_col_test(df, col_name, splitter):
    df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

    df_melt = pd.melt(df_split, id_vars=col_name, var_name='count')

    df_temp = pd.pivot_table(df_melt, index=col_name, columns='value',      values='count', aggfunc=lambda x: True, fill_value=False)

    df_temp = df_temp.reindex(df.index)

    return df_temp

返回测试 df 为:

value                         It  Maybe   Will     it    not probably   this  \
Test 1                                                                         
Will this work             False  False   True  False  False    False   True   
Maybe it will work         False   True  False   True  False    False  False   
It probably will not work   True  False  False  False   True     True  False   

value                       will  work  
Test 1                                  
Will this work             False  True  
Maybe it will work          True  True  
It probably will not work   True  True

作为跟进，我进行了编辑。该函数适用于简单示例，但返回需要解析和扩展的原始列(如果存在 pd.pivot_table() 之后的代码)，如果仅完成 pd.pivot_table() 部分，则返回空数据帧.

我一辈子都弄不明白(我花了一整天时间修补和阅读所涉及的各种功能)。

同样，我有大约 12K 行和 1-3K 列，不确定这是否/如何影响输出。

当前函数:

def expand_df_col_test(df, col_name, splitter, reindex_col):

    import numpy as np

    replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

    df_split = pd.concat((df, df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

    df_melt = pd.melt(df_split, id_vars=list(df.columns), var_name='count')

    df_pivot = pd.pivot_table(df_melt, 
                 index=list(df.columns), 
                 columns=df_melt['value'], 
                 values=df_melt['count'], 
                 aggfunc=lambda x: True, 
                 fill_value= np.nan).reset_index(reindex_col).reindex(df[col_name]).reset_index()

    df_pivot.columns.name = ''

    return df_pivot

我以为我找到了解决方案，但没有正确地重建索引。

现在这个函数在一个子集上工作，但我不断收到 ValueError: cannot reindex from a duplicate axis

def expand_df_col_test(df, col_name, splitter, reindex_col):

import numpy as np

sub_df = pd.concat([df[col_name],df[reindex_col]], axis=1)

replacements = list(pd.Series(df.columns).astype(str) + "_" + col_name)

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).replace(list(df.columns), replacements, regex=True).str.split(splitter, expand=True)), axis=1)

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).str.split(splitter, expand=True)), axis=1)

df_melt = pd.melt(df_split, id_vars=list(sub_df.columns), var_name='count')

df_pivot = pd.pivot_table(df_melt, 
                 index=list(sub_df.columns), 
                 columns='value', 
                 values='count', 
                 aggfunc=lambda x: True, 
                 fill_value= np.nan)

print("pivot")
print(df_pivot)
print("NEXT RESET INDEX WITH REINDEX COL")
print(df_pivot.reset_index(reindex_col))
print("NEXT REINDEX")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]))
print("NEXT RESET INDEX()")
print(df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index())


df_pivot = df_pivot.reset_index(reindex_col).reindex(df[col_name]).reset_index()

df_pivot.columns.name = ''

df_final = pd.concat([df,df_pivot.drop([col_name, reindex_col], axis=1)], axis = 1)

return df_final

最佳答案

更新答案#2

df_list = [df]
for col_name in df.columns:
    splitter = ' '
    df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)
    df_melt = pd.melt(df_split, id_vars=[col_name], var_name='count')
    df_list.append(pd.pivot_table(df_melt, 
                         index=[col_name], 
                         columns='value', 
                         values='count', 
                         aggfunc=lambda x: True, 
                         fill_value=np.nan).reindex(df[col_name]).reset_index(drop=True))
df_final = pd.concat(df_list, axis=1)

                      Test 1              Test 2    It Maybe  Will    it  \
0             Will this work  Is this even legit   NaN   NaN  True   NaN   
1         Maybe it will work                nope   NaN  True   NaN  True   
2  It probably will not work                nope  True   NaN   NaN   NaN   

    not probably  this  will  work    Is  even legit  nope  this  
0   NaN      NaN  True   NaN  True  True  True  True   NaN  True  
1   NaN      NaN   NaN  True  True   NaN   NaN   NaN  True   NaN  
2  True     True   NaN  True  True   NaN   NaN   NaN  True   NaN

更新的答案

看来这个答案与上一个答案之间的唯一区别是您要保留一个额外的列测试 2。以下将完成此操作:

splitter = ' '
df_split = pd.concat((df, df['Test 1'].str.split(splitter, expand=True)), axis=1)
df_melt = pd.melt(df_split, id_vars=['Test 1', 'Test 2'], var_name='count')
df_pivot = pd.pivot_table(df_melt, 
                     index=['Test 1', 'Test 2'], 
                     columns='value', 
                     values='count', 
                     aggfunc=lambda x: True, 
                     fill_value=np.nan)\
             .reset_index('Test 2')\
             .reindex(df['Test 1'])\
             .reset_index()

df_pivot.columns.name = ''

                      Test 1              Test 2    It Maybe  Will    it  \
0             Will this work  Is this even legit   NaN   NaN  True   NaN   
1         Maybe it will work                nope   NaN  True   NaN  True   
2  It probably will not work                nope  True   NaN   NaN   NaN   

    not probably  this  will  work  
0   NaN      NaN  True   NaN  True  
1   NaN      NaN   NaN  True  True  
2  True     True   NaN  True  True

旧答案

您需要提供带有示例结果的示例 DataFrame 以获得更好更快的答案。这是黑暗中的一枪。我将首先提供一个带有一些假数据的示例 DataFrame 并尝试提供解决方案。

# create fake data
df = pd.DataFrame({'col1':['here is some text', 'some more text', 'finally some different text']})

df 的输出

                          col1
0            here is some text
1               some more text
2  finally some different text

用拆分器拆分 col1 中的每个值(这里将是一个空格)

col_name = 'col1'
splitter = ' '
df_split = pd.concat((df[col_name], df[col_name].str.split(splitter, expand=True)), axis=1)

df_split 的输出

                          col1        0     1          2     3
0            here is some text     here    is       some  text
1               some more text     some  more       text  None
2  finally some different text  finally  some  different  text

将所有拆分放在一列中

df_melt = pd.melt(df_split, id_vars='col1', var_name='count')

df_melt 的输出

                           col1 count      value
0             here is some text     0       here
1                some more text     0       some
2   finally some different text     0    finally
3             here is some text     1         is
4                some more text     1       more
5   finally some different text     1       some
6             here is some text     2       some
7                some more text     2       text
8   finally some different text     2  different
9             here is some text     3       text
10               some more text     3       None
11  finally some different text     3       text

最后，旋转上面的 DataFrame，使列为拆分词

pd.pivot_table(df_melt, index='col1', columns='value', values='count', aggfunc=lambda x: True, fill_value=False)

输出

value                       different finally   here     is   more  some  text
col1                                                                          
finally some different text      True    True  False  False  False  True  True
here is some text               False   False   True   True  False  True  True
some more text                  False   False  False  False   True  True  True

关于python - 根据解析的文本将多个 bool 列添加到数据框 - python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41533822/

文章推荐： linux - Debian conky 安装

文章推荐： CSS 溢出扩展异常

文章推荐： linux - conmand 添加和删除路由

文章推荐： html - 在绝对定位分区中定位两个 float block

javascript - 控制台错误 - 解析 AJAX JSON 解析
我一直在使用 AJAX 从我正在创建的网络服务中解析 JSON 数组时遇到问题。我的前端是一个简单的 ajax 和 jquery 组合，用于显示从我正在创建的网络服务返回的结果。尽管知道我的数据库查
xml - Json 解析 vs xml 解析？
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
android - java.lang.NoClassDefFoundError : com. 解析。解析
我在尝试运行 Android 应用程序时遇到问题并收到以下错误 java.lang.NoClassDefFoundError: com.parse.Parse 当我尝试运行该应用时。最佳答案在这
python - 解析 HTML 内容时防止 etree 解析 HTML 实体
有什么办法可以防止etree在解析HTML内容时解析HTML实体吗？ html = etree.HTML('&') html.find('.//body').text 这给了我 '&' 但我想
javascript - 使用 JSON 解析/解析 js 对象时，返回方法中的函数范围会丢失
我有一个有点疯狂的例子，但对于那些 JavaScript 函数作用域专家来说，它看起来是一个很好的练习: (function (global) { // our module number one
java - 使用 Java 解析 HTML 数据(DOM 解析)
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
php - 在服务器上用 PHP 解析 HTML 还是在最终用户端用 JavaScript 解析 HTML 会更好？
我需要编写一个脚本来获取链接并解析链接页面的 HTML 以提取标题和其他一些数据，例如可能是简短的描述，就像您链接到 Facebook 上的内容一样。当用户向站点添加链接时将调用它，因此在客户端启动
node.js - 为什么 npm 包从/AppData 解析，而不是从 local/node_modules 解析？
在 VS Code 中本地开发时，包解析为 C:/Users//AppData/Local/Microsoft/TypeScript/3.5/node_modules/@types//index而不是
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
解析，在哪里可以了解
我被赋予了将一种语言“翻译”成另一种语言的工作。对于使用正则表达式的简单逐行方法来说，源代码过于灵活(复杂)。我在哪里可以了解更多关于词法分析和解析器的信息？最佳答案如果你想对这个主题产生“情绪化
正则表达式 {} 解析
您好，我在解析此文本时遇到问题 { { { {[system1];1;1;0.612509325}; {[system2];1;
JavaScript 解析？
我正在为 adobe after effects 在 extendscript 中编写一些代码，最终变成了 javascript。我有一个数组，我想只搜索单词“assemble”并返回整个 jc3_
JavaScript 解析
我有这段代码: $(document).ready(function() { // }); 问题:FB_RequireFeatures block 外部的代码先于其内部的代码执行。因此 who
解析.netcore项目中IStartupFilter使用教程
背景： netcore项目中有些服务是在通过中间件来通信的，比如orleans组件。它里面服务和客户端会指定网关和端口，我们只需要开放客户端给外界，服务端关闭端口。相当于去掉host，这样省掉了些
解析:继承ViewGroup后的子类如何重写onMeasure方法
1.首先贴上我试验成功的代码复制代码代码如下: protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec)
Python如何对XML 解析
什么是 XML？ XML 指可扩展标记语言（eXtensible Markup Language），标准通用标记语言的子集，是一种用于标记电子文件使其具有结构性的标记语言。你可以通过本站学习 X
解析:php调用MsSQL存储过程使用内置RETVAL获取过程中的return值
【PHP代码】复制代码代码如下: $stmt = mssql_init('P__Global_Test', $conn) or die("initialize sto
解析:清除SQL被注入恶意病毒代码的语句
在SQL查询分析器执行以下代码就可以了。复制代码代码如下: declare @t varchar(255),@c varchar(255) declare table_cursor curs
【JavaScript】前端算法题40道题+解析
前言最近练习了一些前端算法题，现在做个总结，以下题目都是个人写法，并不是标准答案，如有错误欢迎指出，有对某道题有新的想法的友友也可以在评论区发表想法，互相学习🤭 题目题目一: 二维数组中的

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 根据解析的文本将多个 bool 列添加到数据框 - python

df_split = pd.concat((sub_df, sub_df[col_name].astype(str).str.split(splitter, expand=True)), axis=1)

更新答案#2

更新的答案

旧答案