python - 缺失值在 conti var 中替换为 med/mean，在 pandas dataframe 中替换为 categorical var 中的模式

python - 缺失值在 conti var 中替换为 med/mean，在 pandas dataframe 中替换为 categorical var 中的模式 - 按列对数据进行分组后)

转载作者：太空宇宙更新时间：2023-11-04 03:35:10

我有一个 pandas 数据框，其中所有缺失值都是 np.nan，现在我正在尝试替换这些缺失值。我的数据的最后一列是“类”，我需要根据类对数据进行分组，然后获取该组列的均值/中值/模式(基于数据是否分类/连续，正常/非)并用相应的均值/中值/模式替换该组的缺失值。

这是我想出的代码，我知道这是一个大材小用..如果可以的话:

对dataframe的列进行分组
获取列组的中值/众数/均值
替换那些缺失的组
将它们重新组合回原来的 df

那就太好了。

但目前我着陆了，明智地找到替换值(均值/中值/模式)并存储在字典中，然后分离 nan 元组和非 nan 元组..替换 nan 元组中的缺失值..并尝试加入他们回到数据框(我还不知道该怎么做)

def fillMissing(df, dataType):
'''
Args:
    df ( 2d array/ Dict):
                         eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
    dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1 
                        indicating categorical/continuous variable eg:  ('attribute1':1, 'attribute2': 0)

Returns:
    dataframe wih missing values filled
    writes a file with missing values replaces.    


'''
dataLabels = list(df.columns.values)

# the dictionary to hold the values to put in place of nan
replaceValues = {}

for eachlabel in dataLabels:

    thisSer = df[eachlabel]
    if dataType[eachlabel] == 1:                        # if its a continuous variable 
        _,pval = stats.normaltest(thisSer)
        groupedd = thisSer.groupby(df['class'])

        innerDict ={}
        for name, group in groupedd:
            if(pval < 0.5):
                groupMiddle = group.median()            # get the median of the group
            else:
                groupMiddle = group.mean()              # get mean (if group is normal )
            innerDict[name.strip()] = groupMiddle
        replaceValues[eachlabel] = innerDict

    else:                                               # if the series is continuous
        # freqCount = collections.Counter(thisSer)
        groupedd = thisSer.groupby(df['class'])
        innerDict ={}
        for name, group in groupedd:
            freqC = collections.Counter(group)      
            mostFreq = freqC.most_common(1)             # get the most frequent value of the attribute(grouped by class)
            # newGroup = group.replace(np.nan , mostFreq)
            innerDict[name.strip()] = mostFreq[0][0].strip()
        replaceValues[eachlabel] = innerDict
print replaceValues


# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df   

mask=False
for col in df.columns: mask = mask | df[col].isnull()

# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]


for _, row in dfnulls.iterrows():
    for colname in dataLabels:
        if pd.isnull(row[colname]):
            if row['class'].strip() == '>50K':
                row[colname] = replaceValues[colname]['>50K']
            else:
                row[colname] = replaceValues[colname]['<=50K']
        newfile.write(str(row[colname]) + ",")
    newdf.append(row)
    newfile.write("\n")

# here add newdf to dfnotNulls to get finaldf

return finaldf

最佳答案

如果我没理解错的话，这主要是在documentation ，但如果你问这个问题，你可能不会去哪里找。请参阅有关 mode 的注释在底部，因为它比 mean 稍微复杂一些和 median .

df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)

df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))

df
    v    v_mean  v_med  v_mode
1   1  1.000000      1       1
1   2  2.000000      2       2
1   2  2.000000      2       2
1 NaN  1.666667      2       2
2   3  3.000000      3       3
2   4  4.000000      4       4
2   4  4.000000      4       4
2 NaN  3.666667      4       4

请注意 mode()可能不是唯一的，不像 mean和 median pandas 将其作为 Series 返回是因为。为了解决这个问题，我只采用了最简单的方法并添加了 [0]为了提取系列的第一个成员。

关于python - 缺失值在 conti var 中替换为 med/mean，在 pandas dataframe 中替换为 categorical var 中的模式 - 按列对数据进行分组后)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29376392/

文章推荐： html - 如何防止空的 img 元素收缩？

文章推荐： javascript - 更改背景颜色回jquery

文章推荐： python - 等待 url 响应的最长时间

python - 缺失值在 conti var 中替换为 med/mean，在 pandas dataframe 中替换为 categorical var 中的模式 - 按列对数据进行分组后)
我有一个 pandas 数据框，其中所有缺失值都是 np.nan，现在我正在尝试替换这些缺失值。我的数据的最后一列是“类”，我需要根据类对数据进行分组，然后获取该组列的均值/中值/模式(基于数据是否分

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 缺失值在 conti var 中替换为 med/mean，在 pandas dataframe 中替换为 categorical var 中的模式 - 按列对数据进行分组后)