gpt4 book ai didi

python - 尽管我所有的行值都是数字(不是 NaN),为什么 pandas 2min 存储桶打印 NaN?

转载 作者:行者123 更新时间:2023-12-01 09:18:13 25 4
gpt4 key购买 nike

我知道在我的数据中,response_bytes 列没有 NaN 值,因为当我运行: data[data.response_bytes.isna()].count() 我得到的结果是 0.

当我运行 2 分钟桶均值然后 head 时,我得到 NaN:

print(data.reset_index().set_index('time').resample('2min').mean().head())

index identity user http_code response_bytes unknown
time
2018-01-31 09:26:00 0.5 NaN NaN 200.0 264.0 NaN
2018-01-31 09:28:00 NaN NaN NaN NaN NaN NaN
2018-01-31 09:30:00 NaN NaN NaN NaN NaN NaN
2018-01-31 09:32:00 NaN NaN NaN NaN NaN NaN
2018-01-31 09:34:00 NaN NaN NaN NaN NaN NaN

为什么响应字节时间存储平均值具有 NaN 值?

我想尝试并了解时间桶在 pandas 中的工作原理。所以我使用日志文件:http://www.cs.tufts.edu/comp/116/access.log作为输入数据,然后将其加载到pandas DataFrame中,然后应用时间桶2分钟(这是我一生中第一次)并运行mean(),我没想到会在 response_bytes 列中看到任何 NaN,因为所有值都不是 NaN。

这是我的完整代码:

import urllib.request
import pandas as pd
import re
from datetime import datetime
import pytz

pd.set_option('max_columns',10)

def parse_str(x):
"""
Returns the string delimited by two characters.

Example:
`>>> parse_str('[my string]')`
`'my string'`
"""
return x[1:-1]

def parse_datetime(x):
'''
Parses datetime with timezone formatted as:
`[day/month/year:hour:minute:second zone]`

Example:
`>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
`datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
timezone will be obtained using the `pytz` library.
'''
dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

# data = pd.read_csv(StringIO(accesslog))
url = "http://www.cs.tufts.edu/comp/116/access.log"
accesslog = urllib.request.urlopen(url).read().decode('utf-8')
fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto',
'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']

data = pd.read_csv(url, sep=' ', header=None, names=fields, na_values=['-'])

# Panda's parser mistakenly splits the date into two columns, so we must concatenate them
time = data.time_part1 + data.time_part2
time_trimmed = time.map(lambda s: re.split('[-+]', s.strip('[]'))[0]) # Drop the timezone for simplicity
data['time'] = pd.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')

data.head()

print(data.reset_index().set_index('time').resample('2min').mean().head())

我期望response_bytes列的平均值的时间桶不为NaN。

最佳答案

这是预期的行为,因为 resampling转换为常规时间间隔,因此如果没有样本,您将得到 NaN

所以这意味着在大约 2 分钟的间隔之间没有日期时间,例如2018-01-31 09:28:002018-01-31 09:30:00,所以 mean 无法计数和获取NaNs。

print (data[data['time'].between('2018-01-31 09:28:00','2018-01-31 09:30:00')])
Empty DataFrame
Columns: [host, identity, user, time_part1, time_part2, cmd_path_proto,
http_code, response_bytes, referer, user_agent, unknown, time]
Index: []

[0 rows x 12 columns]

关于python - 尽管我所有的行值都是数字(不是 NaN),为什么 pandas 2min 存储桶打印 NaN?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51037433/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com