gpt4 book ai didi

scala - 如何使用用户定义的函数检查 Spark Dataframe 上的空值

转载 作者:行者123 更新时间:2023-12-02 09:07:19 25 4
gpt4 key购买 nike

伙计们,我有这个用户定义的函数来检查文本行是否为空:

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
{{{
val df = Seq(
(0, "","Mongo"),
(1, "World","sql"),
(2, "","")
).toDF("id", "text", "Source")

// Define a "regular" Scala function
val checkEmpty: String => Boolean = x => {
var test = false
if(x.isEmpty){
test = true
}
test
}
val upper = udf(checkEmpty)
df.withColumn("isEmpty", upper('text)).show
}}}

我实际上得到了这个数据框:

+---+-----+------+-------+
| id| text|Source|isEmpty|
+---+-----+------+-------+
| 0| | Mongo| true|
| 1|World| sql| false|
| 2| | | true|
+---+-----+------+-------+

如何检查所有行是否为空值并返回如下消息:

id 0 has the text column with empty values
id 2 has the text,source column with empty values

最佳答案

可以使用 UDF 获取可为空的列作为行,以获取空列名称。然后可以过滤具有非空列的行:

val emptyColumnList = (r: Row) => r
.toSeq
.zipWithIndex
.filter(_._1.toString().isEmpty)
.map(pair => r.schema.fields(pair._2).name)

val emptyColumnListUDF = udf(emptyColumnList)

val columnsToCheck = Seq($"text", $"Source")
val result = df
.withColumn("EmptyColumns", emptyColumnListUDF(struct(columnsToCheck: _*)))
.where(size($"EmptyColumns") > 0)
.select(format_string("id %s has the %s columns with empty values", $"id", $"EmptyColumns").alias("description"))

结果:

+----------------------------------------------------+
|description |
+----------------------------------------------------+
|id 0 has the [text] columns with empty values |
|id 2 has the [text,Source] columns with empty values|
+----------------------------------------------------+

关于scala - 如何使用用户定义的函数检查 Spark Dataframe 上的空值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56497435/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com