pyspark : Interpolation of missing values in pyspark dataframe observed-6ren

pyspark : Interpolation of missing values in pyspark dataframe observed

转载作者：行者123 更新时间：2023-12-02 11:07:58

25

4

我正在尝试使用 Spark 清理时间序列数据集，该数据集未完全填充且相当大。

我想做的是转换以下数据集

Group | TS          |  Value
____________________________
A     | 01-01-2018  |  1
A     | 01-02-2018  |  2
A     | 01-03-2018  |  
A     | 01-04-2018  |  
A     | 01-05-2018  |  5
A     | 01-06-2018  |  
A     | 01-07-2018  |  10
A     | 01-08-2018  |  11

并将其转换为以下内容

Group | TS          |  Value>
____________________________
A     | 01-01-2018  |  1
A     | 01-02-2018  |  2
A     | 01-03-2018  |  3
A     | 01-04-2018  |  4
A     | 01-05-2018  |  5
A     | 01-06-2018  |  7.5
A     | 01-07-2018  |  10
A     | 01-08-2018  |  11

如果您能提供帮助，我们将不胜感激。

最佳答案

与 @ndricca 聊天后，我用 @leo 建议更新了代码。

第一个数据帧创建:

from pyspark.sql import functions as F
from pyspark.sql import Window

data = [
    ("A","01-01-2018",1),
    ("A","01-02-2018",2),
    ("A","01-03-2018",None),
    ("A","01-04-2018",None),
    ("A","01-05-2018",5),
    ("A","01-06-2018",None),
    ("A","01-07-2018",10),
    ("A","01-08-2018",11)
]
df = spark.createDataFrame(data,['Group','TS','Value'])
df = df.withColumn('TS',F.unix_timestamp('TS','MM-dd-yyyy').cast('timestamp'))

接下来是更新的功能:

def fill_linear_interpolation(df,id_cols,order_col,value_col):
    """
    Apply linear interpolation to dataframe to fill gaps.

    :param df: spark dataframe
    :param id_cols: string or list of column names to partition by the window function
    :param order_col: column to use to order by the window function
    :param value_col: column to be filled

    :returns: spark dataframe updated with interpolated values
    """
    # create row number over window and a column with row number only for non missing values

    w = Window.partitionBy(id_cols).orderBy(order_col)
    new_df = df.withColumn('rn',F.row_number().over(w))
    new_df = new_df.withColumn('rn_not_null',F.when(F.col(value_col).isNotNull(),F.col('rn')))

    # create relative references to the start value (last value not missing)
    w_start = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(Window.unboundedPreceding,-1)
    new_df = new_df.withColumn('start_val',F.last(value_col,True).over(w_start))
    new_df = new_df.withColumn('start_rn',F.last('rn_not_null',True).over(w_start))

    # create relative references to the end value (first value not missing)
    w_end = Window.partitionBy(id_cols).orderBy(order_col).rowsBetween(0,Window.unboundedFollowing)
    new_df = new_df.withColumn('end_val',F.first(value_col,True).over(w_end))
    new_df = new_df.withColumn('end_rn',F.first('rn_not_null',True).over(w_end))

    if not isinstance(id_cols, list):
        id_cols = [id_cols]

    # create references to gap length and current gap position
    new_df = new_df.withColumn('diff_rn',F.col('end_rn')-F.col('start_rn'))
    new_df = new_df.withColumn('curr_rn',F.col('diff_rn')-(F.col('end_rn')-F.col('rn')))

    # calculate linear interpolation value
    lin_interp_func = (F.col('start_val')+(F.col('end_val')-F.col('start_val'))/F.col('diff_rn')*F.col('curr_rn'))
    new_df = new_df.withColumn(value_col,F.when(F.col(value_col).isNull(),lin_interp_func).otherwise(F.col(value_col)))

    new_df = new_df.drop('rn', 'rn_not_null', 'start_val', 'end_val', 'start_rn', 'end_rn', 'diff_rn', 'curr_rn')
    return new_df

然后在我们的 DataFrame 上执行函数:

new_df = fill_linear_interpolation(df=df,id_cols='Group',order_col='TS',value_col='Value')

还在我的 df 上检查了它 -> post ，您必须先创建额外的group列。

关于pyspark : Interpolation of missing values in pyspark dataframe observed，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53077639/

25

4

0

文章推荐： java - 在 Java 中将两个 Int 的和存储在 Long 中

文章推荐： clojure - 键*/具有内联值规范的键

文章推荐： asp.net - 将文本框和按钮放入 mvc razor 应用程序中

R dplyr 用第一个非 "missing"值替换 -"missing"列数据
要在标题(或谷歌)中简洁地描述这是一个棘手的问题。我有一个分类表，其中某些列可能会根据置信度列为“已删除”。我想用“未识别”替换任何显示“已删除”的列，后跟第一列中未识别的值以行方式说“掉落”。因此，
python - 你如何修复 "Missing module docstringpylint(missing-module-docstring)"
我在 VSCode 上使用 pygame 模块，但遇到了 pygame 没有 init 成员的问题。我遵循了 this 的解决方案关联。我编辑了用户设置并添加了 "python.linting
ios - 格洛格 : configure: WARNING: 'missing' script is too old or missing
我的问题是如何解决丢失的脚本太旧或丢失!! checking for a BSD-compatible install... /usr/bin/install -c checking whether
java - Spring 启动器 : Missing Bean instead of missing value
我正在使用带有启动器的 Spring Boot。当我错误配置启动器(缺少或定义了错误的值)时，它会打印“缺少 bean”错误消息，而不是“缺少值”。很难找到这个错误。我的开胃菜看起来像 @Condi
Django 操作错误 : missing table; migration does not recognize missing table
我在 Django 1.7 中遇到问题，我正在尝试将用户保存到表中，但我收到一个错误，指出该表不存在。这是我正在执行的代码: from django.conf import settings fro
java - Ehcache中的 "cache misses"和 "in memory cache misses"有什么区别？
我正在查看 EhCache 统计数据，我看到了这些数字: CacheMisses: 75977 CacheHits: 38151 InMemoryCacheMisses: 4843 InMemoryC
r - na.fail.default 中的错误 : missing values in object - but no missing values
我正在尝试使用这些数据运行 lme 模型: tot_nochc=runif(10,1,15) cor_partner=factor(c(1,1,0,1,0,0,0,0,1,0)) age=runif(
c++ - 在另一台计算机上运行 .exe 文件时出现 "Missing MSCVP140.dll"和 "Missing VCRUNTIME140.dll"
我在 Microsoft Visual Studio C++ 中编写了一个程序，并为此使用了 SFML。我包含了程序所需的正确的 .dll 文件，并将它们复制到“发布”文件夹中。有效。整个程序在我的电
Getting console error "Uncaught SyntaxError: missing ) after argument list"(在参数列表之后获取控制台错误“unauCaptSynaxError：Missing)”)
在设置新的Reaction CSR应用程序、一些样板库等过程中。在控制台中收到以下错误：。现在，我不会去修复一些我没有维护的包。我怎么才能找到真正的问题呢？Vite dev Build没有报告错误。
javascript - 流 JavaScript "Missing type annotation for T"和 "Missing type annotation for S"
我正在上 React Native 类(class)，然后使用 Flow 尝试纠正类(class)中的错误，因为讲师没有使用任何类型检查。我在 Flow 中遇到了另一个错误，通过在互联网上进行长时间
javascript - 取出图片标签 alt :missing. "image tag without an alt id is prefered and not showing missing"
我想删除图像标签正在寻找的缺失错误。我不想要 ult 标签占位符，试图故意将其保留为空白，直到我使用回形针浏览上传照片。我已经将 url(:missing) 更改为许多其他内容，例如 nil 等。是
SQL 错误 : ORA-00906: missing left parenthesis 00906. 00000 - "missing left parenthesis"
CREATE TABLE customer(customer_id NUMBER(6) PRIMARY KEY , customer_name VARCHAR2(40) NOT NULL , cust
node.js - reCAPTCHA - 验证用户响应时的错误代码 : 'missing-input-response' , 'missing-input-secret'(缺少 POST 详细信息)
我正在设置 invisible reCAPTCHA在我的 Web 应用程序中并且无法验证用户的响应。 (即使我传递了正确的 POST 参数) 我通过调用 grecaptcha.execute(); 以
c# - 使用 Office PIA 时出现 System.Type.Missing 或 System.Reflection.Missing.Value？
我搜索了 these SO results找不到与我的问题相关的任何内容。我怀疑这可能是重复的。我目前正在 .NET C# 3.5 中编写 Microsoft.Office.Interop.Exce
c++ - 错误 C4430 : missing type specifier/error C2143: syntax error : missing ';' before '*'
我在同一行收到两个错误。 Bridge *在 Lan 类中排名第一。我错过了什么？ #include #include #include using namespace std; class L
c++ - C2143 : syntax error: missing ';' before '*' & C4430: missing type specifier - int assumed. 注意:C++不支持default-int
首先，我看到了一些解决方案，但我没有理解它们。我是 QT 的新手，甚至谷歌也没有帮助我。英语不是我的母语这是在QT Creator 5.6中调试后的报错信息 C2143: syntax error:
missing-data - 从基本记录生成记录序列
有没有办法把表1展开成表2？就是将start_no和end_no之间的每一个整数作为seq_no字段输出，取原表的其他字段组成新表(表2)。表 1: date source market
Excel旭日图: Some labels missing
我在 Excel (2016) 中制作了一个旭日形图，并希望为所有数据点添加标签。问题是，Excel 会自动丢弃一些标签: 似乎标签被删除是因为数据点太小或标签字符串太长。如何让 Excel 显示所有
带有变量名的 R missing()
在 R 3.0.2 中，missing() 函数可以告诉我们是否缺少形式参数。如何避免硬编码传递给丢失的变量名称？例如在 demoargs <- function(a=3, b=2, d) {
返回按钮时出现参数错误后的 Javascript:missing )
我试图在 UI 上的某些功能中返回一个按钮，但出现了一个奇怪的错误。有人可以帮忙吗？ var div = "View" 我得到的错误是: 参数列表后缺少 )。最佳答案 onclick="javas

首页

博学

6Ren·AI

商城

pyspark : Interpolation of missing values in pyspark dataframe observed