python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧-6ren

python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧

转载作者：太空狗更新时间：2023-10-30 00:18:22

我正在使用一个数据框，其中每个条目(行)都带有开始时间、持续时间和其他属性。我想从这个数据框创建一个新的数据框，我将在其中将每个条目从原始条目转换为 15 分钟的间隔，同时保持所有其他属性相同。新数据帧中每个条目在旧数据帧中的条目数量将取决于原始数据帧的实际持续时间。

起初我尝试使用 pd.resample 但它并没有完全达到我的预期。然后，我使用 itertuples() 构建了一个运行良好的函数，但对于大约 3000 行的数据帧，它花费了大约半个小时。现在我想对 200 万行执行相同的操作，因此我正在寻找其他可能性。

假设我有以下数据框:

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)

>>>testdf
                 start  duration Attribute_A  id
0  2018-01-05 11:48:00        22         abc   1
1  2018-05-04 09:05:00         8         def   2
2  2018-08-09 07:15:00        35         hij   3
3  2018-09-27 15:00:00         2         klm   4

我希望我的结果如下所示:

>>>resultdf
                start  duration Attribute_A  id
0 2018-01-05 11:45:00        12         abc   1
1 2018-01-05 12:00:00        10         abc   1
2 2018-05-04 09:00:00         8         def   2
3 2018-08-09 07:15:00        15         hij   3
4 2018-08-09 07:30:00        15         hij   3
5 2018-08-09 07:45:00         5         hij   3
6 2018-09-27 15:00:00         2         klm   4

这是我用 itertuples 构建的函数，它产生了期望的结果(我在上面展示的那个):

def min15_divider(df,newdf):
for row in df.itertuples():
    orig_min = row.start.minute
    remains = orig_min % 15 # Check if it is already a multiple of 15
    if remains == 0:
        new_time = row.start.replace(second=0)
        if row.duration < 15: # if it shorter than 15 min just use that for the duration
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
                         'duration': row.duration, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
        else: # if not, divide that in 15 min intervals until duration is exceeded
            cumu_dur = 15
            while cumu_dur < row.duration:
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
                if cumu_dur < 15:
                    to_append['duration'] = cumu_dur
                else:
                    to_append['duration'] = 15
                new_time = new_time + pd.Timedelta('15 minutes')
                cumu_dur = cumu_dur + 15
                newdf = newdf.append(to_append, ignore_index=True)

            else: # add the remainder in the last 15 min interval
                final_dur = row.duration - (cumu_dur - 15)
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
                newdf = newdf.append(to_append, ignore_index=True)

    else: # When it is not an exact multiple of 15 min
        new_min = orig_min - remains # convert to multiple of 15
        new_time = row.start.replace(minute=new_min)
        new_time = new_time.replace(second=0)
        cumu_dur = 15 - remains # remaining minutes in the initial interval
        while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
            if cumu_dur < 15:
                to_append['duration'] = cumu_dur
            else:
                to_append['duration'] = 15

            new_time = new_time + pd.Timedelta('15 minutes')
            cumu_dur = cumu_dur + 15
            newdf = newdf.append(to_append, ignore_index=True)

        else: # when we reach the last interval or the starting duration was less than the remaining minutes
            if row.duration < 15:
                final_dur = row.duration # original duration less than remaining minutes in first interval
            else:
                final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
return newdf

有没有其他方法可以在不使用 itertuples 的情况下节省我一些时间？

提前致谢。

附言。对于我的帖子中可能看起来有点奇怪的任何内容，我深表歉意，因为这是我第一次在 stackoverflow 中自己提出问题。

编辑

许多条目可以有相同的开始时间，所以 .groupby 'start' 可能会有问题。但是，每个条目都有一个具有唯一值的列，简称为“id”。

最佳答案

使用 pd.resample 是个好主意，但由于每行只有开始时间，因此需要先构建结束行才能使用。

下面的代码假定'start' 列中的每个开始时间都是唯一的，因此grouby 可以用在一些不寻常的地方方式，因为它只会提取一行。
我使用 groupby 因为它会自动重新组合由 apply 使用的自定义函数生成的数据帧。
另请注意，'duration' 列在分钟内转换为 timedelta，以便稍后更好地执行一些数学运算。

import pandas as pd

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)

def calcduration(df, starttime):
    if len(df) == 1:
        return
    elif len(df) == 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
    elif len(df) > 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
        df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()

def expandtime(x):
    frow = x.copy()
    frow['start'] = frow['start'] + frow['duration']
    gdf = pd.concat([x, frow], axis=0)
    gdf = gdf.set_index('start')
    resdf = gdf.resample('15T').nearest()
    calcduration(resdf, x['start'].iloc[0])
    return resdf

findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)

此代码产生:

                      duration Attribute_A
  start                                   
0 2018-01-05 11:45:00 00:12:00         abc
  2018-01-05 12:00:00 00:10:00         abc
1 2018-05-04 09:00:00 00:08:00         def
2 2018-08-09 07:15:00 00:15:00         hij
  2018-08-09 07:30:00 00:15:00         hij
  2018-08-09 07:45:00 00:05:00         hij
3 2018-09-27 15:00:00 00:02:00         klm

一些解释

expandtime 是第一个自定义函数。它采用一行数据帧(因为我们假设 'start' 值是唯一的)，构建第二行，其 'start' 等于 'start ' 第一行 + 持续时间，然后使用 resample 以 15 分钟的时间间隔对其进行采样。所有其他列的值都是重复的。

calcduration 用于对 'duration' 列进行一些数学计算，以计算每行的正确持续时间。

关于python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56650656/

文章推荐： python - 来自 Amazon.com(和 Amazon.in)的程序化结账

文章推荐： c# - NSubstitute:Substitute.For<> 和 Substitute.ForPartsOf 的区别

文章推荐：当有多个重载可用时，具有可变数量参数的 C# 函数会导致混淆

r - 采样 - 限制每个元素被使用的次数
我正在寻找一种方法来对数字进行 1:40、3812 次(长度 = 3812)的采样，并进行替换 - 但对其进行限制，使每个数字的使用次数不会超过 100 次。有没有办法在采样命令 (sample())
python - Pandas 采样
如果我想随机采样 pandas 数据帧，我可以使用 pandas.DataFrame.sample . 假设我随机抽取 80% 的行。如何自动获取另外 20% 未选取的行？最佳答案正如 Lager
python - tensorflow 采样
我使用以下函数在每个图像中采样点。如果batch_size为None，tf.range会给出错误。如何在 tensorflow 中采样 def sampling(binary_selection,nu
audio - 采样 wav 文件以获取特定时间的幅度
我想知道是否有任何方法可以循环浏览 .wav 文件以获取 wav 文件中特定点的振幅/DB。我现在正在将它读入一个字节数组，但这对我来说没有任何帮助。我将它与我开发的一些硬件结合使用，这些硬件将光数
sql - 采样 SQL 时间序列
我有一个日期时间的时间序列，双列存储在 mySQL 中，并且希望每分钟对时间序列进行采样(即以一分钟为间隔提取最后一个值)。在一个 select 语句中是否有一种有效的方法来做到这一点？蛮力方式将涉
c++ - 采样 D3D11 深度缓冲区时出现问题
我正在为延迟渲染管道准备好我的一个小型 DirectX 11.0 项目中的一切。但是，我在从像素着色器中对深度缓冲区进行采样时遇到了很多麻烦。首先我定义深度纹理及其着色器资源 View :
php - 量子值的 SQL 采样
问题出现在量子值的样本上。情况是: 有一个表支付(payments): id_user[int] sum [int] date[date] 例如， sum(数量) 可以是 0 到 100,000 之间
c++ - 渲染中的区域采样与 BRDF 采样
这是一个理论问题。我目前正在研究渲染方程，我不明白在哪种情况下区域采样或半球采样更好以及为什么。我想知道的另一件事是，如果我们采用两种方法的平均值，结果是否会更好？最佳答案 Veach 和 Gui
python - 包裹二维数组中子数组的高效 Numpy 采样
我有一个 4x4 阵列，想知道是否有办法从它的任何位置随机抽取一个 2x2 正方形，允许正方形在到达边缘时环绕。例如: >> A = np.arange(16).reshape(4,-1) >> s
hadoop - 采样 HBase 表键空间
我想构建 HBase 表的行键空间的随机样本。例如，我希望 HBase 中大约 1% 的键随机分布在整个表中。执行此操作的最佳方法是什么？我想我可以编写一个 MapReduce 作业来处理所有数据
没有纹理绑定(bind)的 OpenGL 采样
当像这样在 GLSL 中对纹理进行采样时: vec4 color = texture(mySampler, myCoords); 如果没有纹理绑定(bind)到 mySampler，颜色似乎总是 (0
python - Keras 模型中的 Softmax 采样
我考虑过的一些方法: 继承自Model类 Sampled softmax in tensorflow keras 继承自Layers类 How can I use TensorFlow's sampl
使用 JOIN 进行 MySQL 采样
我有表clients，其中包含id、name、company列。表agreements，其中包含id、client_id、number、created_at列. 一对多关系。我的查询: SELEC
python - Tensorflow 采样 Softmax 损失正确使用
在具有许多类的分类问题中，tensorflow 文档建议使用 sampled_softmax_loss通过一个简单的 softmax减少训练时间。根据docs和 source (第 1180 行)，
python - 采样 Pandas Dataframe 的最快方法？
首先，我想从三个数据帧(每个 150 行)中随机抽取样本并连接结果。其次，我想尽可能多地重复这个过程。对于第 1 部分，我使用以下函数: def get_sample(n_A, n_B, n_C):
c# - 如何在像素着色器中实现 super 采样/抗锯齿？
我正在尝试编写几个像素着色器以应用于类似于 Photoshop 效果的图像。比如这个效果: http://www.geeks3d.com/20110428/shader-library-swirl-p
python - 采样/分析 PyObjC 应用程序的最佳方法是什么？
使用 Activity Monitor/Instruments/Shark 进行采样将显示充满 Python 解释器 C 函数的堆栈跟踪。如果能看到相应的 Python 符号名称，我会很有帮助。是否有
php - GAPI-Google Analytics(分析)采样。
我正在使用GAPI API来访问Google Analytics（分析），而不是直接自己做（我知道有点懒...）。我看过类文件，但看不到任何用于检查采样的内置函数。我想知道使用它的人是否找到了一种方法
oracle - 从 Oracle 采样，需要准确数量的结果(示例子句)
我正在尝试从 Peoplesoft 数据库中随机抽取总体样本。在线搜索使我认为 select 语句的 Sample 子句可能是我们使用的一个可行选项，但是我无法理解 Sample 子句如何确定返回的样
python - 我尝试以 100hz 采样，而不是按照程序运行的速度采样。我该怎么做呢？
我有一个程序，在其中我只是打印到 csv，我想要每秒正好 100 个样本点，但我不知道从哪里开始或如何做!请帮忙! from datetime import datetime import panda

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧

编辑

一些解释