我有一个 pandas 数据框,其中所有缺失值都是 np.nan,现在我正在尝试替换这些缺失值。我的数据的最后一列是“类”,我需要根据类对数据进行分组,然后获取该组列的均值/中值/模式(基于数据是否分类/连续,正常/非)并用相应的均值/中值/模式替换该组的缺失值。
这是我想出的代码,我知道这是一个大材小用..如果可以的话:
- 对dataframe的列进行分组
- 获取列组的中值/众数/均值
- 替换那些缺失的组
- 将它们重新组合回原来的 df
那就太好了。
但目前我着陆了,明智地找到替换值(均值/中值/模式)并存储在字典中,然后分离 nan 元组和非 nan 元组..替换 nan 元组中的缺失值..并尝试加入他们回到数据框(我还不知道该怎么做)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf
如果我没理解错的话,这主要是在documentation ,但如果你问这个问题,你可能不会去哪里找。请参阅有关 mode
的注释在底部,因为它比 mean
稍微复杂一些和 median
.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
请注意 mode()
可能不是唯一的,不像 mean
和 median
pandas 将其作为 Series
返回是因为。为了解决这个问题,我只采用了最简单的方法并添加了 [0]
为了提取系列的第一个成员。
我是一名优秀的程序员,十分优秀!