gpt4 book ai didi

Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出

转载 作者:行者123 更新时间:2023-12-01 06:40:56 25 4
gpt4 key购买 nike

我有一个像这样的数据框我想按 url 和状态对它们进行分组,并按日期拆分记录,这是一种更有效的方法吗?

def transform_to_unique(df):
test = []
counter = 0

#first_row
if df.loc[0, 'status']!= df.loc[1, 'status']:
counter = counter +1
test.append(counter)

for i in range(1, len(df)):

if df.loc[i-1, 'url']!= df.loc[i, 'url']:
counter=0

if df.loc[i-1, 'status']!= df.loc[i, 'status'] :
counter = counter +1
test.append(counter)

df['test'] = pd.Series(test)

return df

df = transform_to_unique(frame)

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

ouptut from script

这是一个数据框:

1000,20191109,active1000,20191108,inactive2000,20191109,active2000,20191101,inactive351,20191109,active351,20191102,active351,20191026,active351,20191019,active351,20191012,active351,20191005,active351,20190928,inactive351,20190921,inactive351,20190914,inactive351,20190907,active351,20190831,active351,20190615,inactive3000,20200101,active
import pandas as pd
frame =pd.read_clipboard(sep=",", header=None)
frame.columns = ['url', 'date_scraped', 'status']

最佳答案

我不确定我是否正确理解了 test 列的标题,但这是否是您想要实现的目标(基于您发布的示例数据):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["date_scraped_till"]=np.where(df["url"]==df["url"].shift(-1),

df["date_scraped"].shift(-1), np.nan).astype(np.int32)

输出:

     url  date_scraped    status  date_scraped_till
15 351 20190615 inactive 20190831
14 351 20190831 active 20190907
13 351 20190907 active 20190914
12 351 20190914 inactive 20190921
11 351 20190921 inactive 20190928
10 351 20190928 inactive 20191005
9 351 20191005 active 20191012
8 351 20191012 active 20191019
7 351 20191019 active 20191026
6 351 20191026 active 20191102
5 351 20191102 active 20191109
4 351 20191109 active 0
1 1000 20191108 inactive 20191109
0 1000 20191109 active 0
3 2000 20191101 inactive 20191109
2 2000 20191109 active 0
16 3000 20200101 active 0

编辑

如果您的意思不是“拆分”,而是“折叠”,那么这应该可以解决问题(这基本上是执行测试列的更有效方法):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["test"]=np.where((df["url"]==df["url"].shift(1)) & (df["status"]==df["status"].shift(1)), 0,1)

df["test"]=df.groupby(["url", "status", "test"])["test"].cumsum().replace(to_replace=0, method='ffill')

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

输出:

                    max       min
url status test
351 active 1 20190907 20190831
2 20191109 20191005
inactive 1 20190615 20190615
2 20190928 20190914
1000 active 1 20191109 20191109
inactive 1 20191108 20191108
2000 active 1 20191109 20191109
inactive 1 20191101 20191101
3000 active 1 20200101 20200101

关于Python Dataframe 迭代行(比较它们之间的值)并准备一组作为输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59460167/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com