gpt4 book ai didi

python - pandas:将字符串列拆分为多列并动态命名列

转载 作者:行者123 更新时间:2023-12-02 01:57:29 24 4
gpt4 key购买 nike

我的问题类似于this onethis one但我无法让他们的解决方案解决我的问题。

我有一个如下所示的数据框:

    study_id    fuzzy_market
0 study1 [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
1 study2 [Country: Germany], [Management experience: Yes]
2 study3 [Country: United Kingdom], [Language: English]
3 study4 [Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]
4 study5 [Age: 48-99]

我希望它看起来像这样:

<表类=“s-表”><标题>study_id年龄出生国家国家语言管理经验 <正文>研究118-67奥地利、德国奥地利、德国德语无研究2无无德国无是研究3无无英国英语无研究418-67奥地利、德国奥地利、德国德语无研究548-99无无无无

所以每 study_id 一行, fuzzy_market 中每个冒号之前的文本列作为列标题,每个冒号后面的文本作为单元格中的数据。如果某列没有相关数据,我想用 None 填充它。所有列都可以是字符串。我不知道会有多少列,所以我需要它是动态的。

这是设置和数据:

import pandas as pd
import numpy as np
import re

np.random.seed(12345)

df = pd.DataFrame.from_dict({'study_id': {0: 'study1',
1: 'study2',
2: 'study3',
3: 'study4',
4: 'study5'},
'fuzzy_market': {0: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
1: '[Country: Germany], [Management experience: Yes]',
2: '[Country: United Kingdom], [Language: English]',
3: '[Age: 18-67], [Country of Birth: Austria, Germany], [Country: Austria, Germany], [Language: German]',
4: '[Age: 48-99]'}})

到目前为止,我已经尝试操作 fuzzy_markets 中的字符串专栏,但我认为这种方法不正确。

# a function to strip the square brackets, as I'm not sure this is really a list in here
def remove_square_brackets(x):
return re.sub(r"[\[\]]", "", x)

# make a new dataframe where there are new columns for data after every comma
df2 = df.join(df['fuzzy_market'].apply(remove_square_brackets).str.split(',', expand=True))

# rename the columns arbitrarily - these will need to be the question titles eventually e.g. Age rather than A, Country of Birth rather than B etc.
df2.columns = ('study_id', 'fuzzy_market', 'A', 'B', 'C', 'D', 'E', 'F')

# try and split again
df3 = df2[['study_id','A', 'B']].join(df2['A'].str.split(":", expand=True).rename(columns={0:'A1', 1:'A2'})).join(df2['B'].str.split(":", expand=True).rename(columns={0:'B1', 1:'B2'}))

# this isn't quite there yet
df3

study_id A B A1 A2 B1 B2
0 study1 Age: 18-67 Country of Birth: Austria Age 18-67 Country of Birth Austria
1 study2 Country: Germany Management experience: Yes Country Germany Management experience Yes
2 study3 Country: United Kingdom Language: English Country United Kingdom Language English
3 study4 Age: 18-67 Country of Birth: Austria Age 18-67 Country of Birth Austria
4 study5 Age: 48-99 None Age 48-99 None None

感谢您的帮助或提示!

最佳答案

我们可以使用findall从每一行中提取所有匹配的键值对,然后将这些对映射到 dict 并创建一个数据帧

p = df['fuzzy_market'].str.findall(r'([^:\[]+): ([^\]]+)')
df[['study_id']].join(pd.DataFrame(map(dict, p)))

  study_id    Age  Country of Birth           Country Language Management experience
0 study1 18-67 Austria, Germany Austria, Germany German NaN
1 study2 NaN NaN Germany NaN Yes
2 study3 NaN NaN United Kingdom English NaN
3 study4 18-67 Austria, Germany Austria, Germany German NaN
4 study5 48-99 NaN NaN NaN NaN

关于python - pandas:将字符串列拆分为多列并动态命名列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69453499/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com