gpt4 book ai didi

Python从txt读取具有重复值的csv

转载 作者:太空宇宙 更新时间:2023-11-03 20:25:30 25 4
gpt4 key购买 nike

我得到了文本文件,我的文本文件包含有关部门和经理的信息txt 中的示例部分:

department: sale
group : building a
manager::sergey

department: hr
group : building a
manager::tom
location:somewhereelse

department: health
group : building b
manager::jeniffer
manager::billy

department: security
group : building b
manager::john

每个部门之间的经理姓名后有一个空行。行中每个信息后有一个空格。

我使用 df_readcsv 并转换该 csv 来制作包含 dep、group、manager、location 的列

有2个问题

1- 有时部门(如卫生部门)可能有 2 名经理。如果同一部门有 2 位经理,我就失败了。我的代码仅适用于 1 位经理。如何与其他经理一起添加一栏?

2- 有时,人力资源部门的行数比其他行数多。有关于位置的信息。进行常规 df 变得更加困难。我想它需要动态结构:(

它应该是什么的示例

dep           group             manager     Location
sale building a sergey ""
hr building a tom somewhereelse
health building b jeniffer ""
health building b billy ""

我能做什么?

我的代码

df = pd.read_csv('sample.txt', sep="\n")
df = data.replace({ '# department: ' : " ", '# group:' : " ",'# manager:' : " ",}, regex= True)
ab = pd.DataFrame(df.values.reshape(-1, 7),
columns=["department","group","manager"])

最佳答案

您可以按如下方式执行此操作:

import pandas as pd

# read the file in the structure it is
# using a consequtive series of : as
# field separators
# and cutting off the spaces before
# and after the colon
# the two fields are named _key and _value
# skip the blank lines
df= pd.read_csv(
io.StringIO(raw),
sep='\s*:+\s*',
skip_blank_lines=True,
names=['_key', '_value'],
engine='python')

# now build groups to identify which
# rows belong together
# assuming each group begins with
# a line containing department in the
# key column
group_ser= df['_key'] == 'department'
df['_group']= group_ser.cumsum()

# now get rid of the duplicate manager
# rows by putting the manager values
# into a list
df_agg_managers= df[df['_key']=='manager'].groupby(['_group', '_key']).agg({'_value':list})

# the rest actually does not deed to be
# aggregated according to your description
# so just add the index structure so it
# matches the one of df_agg_managers
df_rest= df[df['_key']!='manager'].set_index(['_group', '_key'])
df_all= pd.concat([df_agg_managers, df_rest], axis='index')

# now use the content of the _key column
# to build columns (like in a pivot operation)
df_unstacked= df_all.unstack()
# get rid of the first level in the column names
df_unstacked.columns= df_unstacked.columns.get_level_values(1)

# count the number of entries per manager
# column
len_ser= df_unstacked['manager'].map(len)
# below we loop only over the rows that contain
# lists with at least one element, so we need
# an extra treatment for the rows with empty
# lists or NaN
df_result= df_unstacked.loc[(len_ser<1) | df_unstacked['manager'].isna()].reset_index()
# now create one row per entry in the manager-list
for i in range(len_ser.max()):
df_work= df_unstacked.loc[len_ser > i].copy()
df_work['manager']= df_work['manager'].map(lambda lst: lst[i])
df_result= pd.concat([df_result, df_work.reset_index()], axis='index', ignore_index=True)

# apply some final cosmetics
# sort the rows so they appear groupwise
# then remove the _group column
df_result.sort_values('_group').drop(['_group'], axis='columns')

结果是:

_key department       group       location   manager
0 sale building a NaN sergey
1 hr building a somewhereelse tom
2 health building b NaN jeniffer
4 health building b NaN billy
3 security building b NaN john

如果在以下测试数据上运行:

raw="""department: sale
group : building a
manager::sergey

department: hr
group : building a
manager::tom
location:somewhereelse

department: health
group : building b
manager::jeniffer
manager::billy

department: security
group : building b
manager::john"""

注意:如果所有组都以部门开头的假设是错误的,您还可以使用空行对内容进行分组。您可以将前几行替换为:

df= pd.read_csv(
io.StringIO(raw),
sep='\s*:+\s*',
skip_blank_lines=False,
names=['_key', '_value'],
engine='python')

group_ser= df['_key'].isna()
df['_group']= group_ser.cumsum()+1
df.drop(group_ser[group_ser==True].index, axis='index', inplace=True)

关于Python从txt读取具有重复值的csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57852652/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com