gpt4 book ai didi

python - 从不一致命名的列创建数据框

转载 作者:太空宇宙 更新时间:2023-11-03 15:11:00 25 4
gpt4 key购买 nike

我有一个 pandas.DataFrame,由于文件(.csv)的命名不一致,它具有冗余的列名称。这导致列大部分为 NaN 值

Bike #  Bikenumber  Bike#   SubscriberType  SubscriptionType
NaN NaN W20848 NaN Subscriber
NaN NaN W20231 NaN Subscriber
NaN NaN W00785 NaN Subscriber
NaN NaN W00126 NaN Subscriber
NaN NaN W20929 NaN Casual

有没有办法创建一个新列并从多个具有值的列填充它?如果不止一列不是 NaN,我可以选择从哪一列提取值吗?

 Bike#   Bikenumber   Bike #   Selected_Num
number1 number2 NaN number2

当我尝试填充单列时可以得到这个

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'])
print(sample)

Bike # Bikenumber Bike# SubscriberType SubscriptionType Bike_Num
NaN NaN W20848 NaN Subscriber W20848
NaN NaN W20231 NaN Subscriber W20231
NaN NaN W00785 NaN Subscriber W00785
NaN NaN W00126 NaN Subscriber W00126
NaN NaN W20929 NaN Casual W20929

这失败了

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

最佳答案

我建议您在阅读 CSV 时解决这个问题,而不是稍后尝试解决它们。一种方法是在将 CSV 文件传递​​给 pandas 之前使用小型解析器。

此解析器采用 csv 的打开文件句柄,以及将所需列名称映射到各种可能的同义词的字典。

代码:

def read_my_csv(file_handle, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))

# build csv reader
reader = csv.reader(file_handle)

# get the header, and map columns to desired names
header = next(reader)
header = [synoms.get(c, c) for c in header]

# yield the header
yield header

# yield the remaining rows
for row in reader:
yield row

测试代码:

import pandas as pd
import csv

column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

with open("sample.csv", 'rU') as f:
generator = read_my_csv(f, column_map)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)

print(df)

示例.csv:

Bike #,SubscriptionType
W20848,Subscriber
W20231,Subscriber
W00785,Subscriber
W00126,Subscriber
W20929,Casual

结果:

  Bike_Num     Sub_Num
0 W20848 Subscriber
1 W20231 Subscriber
2 W00785 Subscriber
3 W00126 Subscriber
4 W20929 Casual

解决方案#2

一个更干净但不那么有趣的解决方案是在执行连接之前重命名列:

代码:

def fix_column_names(df, column_map):
# reverse the column mapping dict to use for synonym lookup
synoms = dict(sum([
[(syn, k) for syn in v] for k, v in column_map.items()], []))

# rename columns
df.columns = [synoms.get(c, c) for c in df.columns]

测试代码:

import pandas as pd
import csv

column_map = {
'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'),
'Sub_Num': ('SubscriberType', 'SubscriptionType'),
}

df = pd.read_csv('sample.csv', header=0)
fix_column_names(df, column_map)
print(df)

关于python - 从不一致命名的列创建数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44211340/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com