gpt4 book ai didi

python - 查找数据异常值

转载 作者:太空宇宙 更新时间:2023-11-03 15:48:40 26 4
gpt4 key购买 nike

我正在尝试查找具有标准差的秒数异常值。我有两个数据框,如下所示。我试图找到的异常值与一周中的每一天的平均值相差 1.5 个标准差?当前代码位于数据框下方。

df1:

name    dateTime              Seconds
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
joe 2015-01-02 13:13:13 12345.0101

当前输出:df2

name   day   standardDev        mean           count
Joe mon 22326.502700 40900.730647 1886
tue 9687.486726 51166.213836 159
john mon 10072.707891 41380.035108 883
tue 5499.475345 26985.938776 196

预期输出:

df2

name   day   standardDev        mean           count     events
Joe mon 22326.502700 40900.730647 1886 [2015-02-04 12:12:12, 2015-02-04 12:12:13]
tue 9687.486726 51166.213836 159 [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john mon 10072.707891 41380.035108 883 [2015-01-02 13:13:13, 2015-01-02 13:13:15]
tue 5499.475345 26985.938776 196 [2015-01-02 13:13:13, 2015-01-02 13:13:18]

代码:

allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
df = df.ix[1:]
list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))

最佳答案

这是对 pandas docs 的轻微改编。 。我没有为平均值和标准差创建列,但如果您想查看它,您可以很容易地添加它。

np.random.seed(1111)
df=pd.DataFrame({ 'name': ['joe','john']*30,
'dateTime': pd.date_range('1-1-2015',periods=60),
'Seconds': np.random.randn(60)+5000. })

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds']
df['zscore'] = grp.transform( lambda x: (x-x.mean())/x.std())

df[ df['zscore'].abs() > 1.5 ]
Out[79]:
Seconds dateTime name zscore
1 4998.927011 2015-01-02 john -1.522488
42 5001.275866 2015-02-12 joe 1.636829
58 4999.124550 2015-02-28 joe -1.624945

df.head(10)
Out[80]:
Seconds dateTime name zscore
0 4998.699990 2015-01-01 joe -0.959960
1 4998.927011 2015-01-02 john -1.522488
2 5000.790199 2015-01-03 joe 0.263690
3 4999.121735 2015-01-04 john -1.005137
4 5001.501822 2015-01-05 joe 1.132407
5 4999.976071 2015-01-06 john 0.678951
6 5000.275949 2015-01-07 joe 0.650297
7 4999.033607 2015-01-08 john -0.964222
8 4998.419685 2015-01-09 joe -1.328744
9 4999.796325 2015-01-10 john 1.224198

关于python - 查找数据异常值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41527280/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com