gpt4 book ai didi

apache-spark - 如何通过嵌套数组字段(数组中的数组)过滤Spark sql?

转载 作者:行者123 更新时间:2023-12-03 08:43:54 25 4
gpt4 key购买 nike

我的 Spark Dataframe 架构

|-- goods: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- brand_id: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- product_id: string (nullable = true)

我可以按product_id过滤数据框

select * from goodsInfo where array_contains(goods.product_id, 'f31ee3f8-9ba2-49cb-86e2-ceb44e34efd9')

但是我无法按brand_id进行过滤,它是数组中的数组..

尝试时出现错误

select * from goodsInfo where array_contains(goods.brand_id, '45c060b9-3645-49ad-86eb-65f3cd4e9081')

错误:

function array_contains should have been array followed by a value with same element type, but it's [array<array<string>>, string].; line 1 pos 45;

有人可以帮忙吗?

提前致谢

最佳答案

另一种替代方法,其他答案中没有-

Code is self-explanatory

    val data=
"""
|{
| "goods": [{
| "brand_id": ["brand1", "brand2", "brand3"],
| "product_id": "product1"
| }]
|}
""".stripMargin
val df = spark.read.json(Seq(data).toDS())
df.show(false)
df.printSchema()
df.createOrReplaceTempView("goodsInfo")

/**
* +--------------------------------------+
* |goods |
* +--------------------------------------+
* |[[[brand1, brand2, brand3], product1]]|
* +--------------------------------------+
*
* root
* |-- goods: array (nullable = true)
* | |-- element: struct (containsNull = true)
* | | |-- brand_id: array (nullable = true)
* | | | |-- element: string (containsNull = true)
* | | |-- product_id: string (nullable = true)
*/

// filter Dataframe by product_id
spark.sql("select * from goodsInfo where array_contains(goods.product_id, 'product1')").show(false)

/**
* +--------------------------------------+
* |goods |
* +--------------------------------------+
* |[[[brand1, brand2, brand3], product1]]|
* +--------------------------------------+
*/
// filter Dataframe by brand_id which is an array within array..
// positive case
spark.sql("select * from goodsInfo where array_contains(flatten(goods.brand_id), 'brand3')")
.show(false)

/**
* +--------------------------------------+
* |goods |
* +--------------------------------------+
* |[[[brand1, brand2, brand3], product1]]|
* +--------------------------------------+
*/
// negative case
spark.sql("select * from goodsInfo where array_contains(flatten(goods.brand_id), 'brand4')")
.show(false)

/**
* +-----+
* |goods|
* +-----+
* +-----+
*/

关于apache-spark - 如何通过嵌套数组字段(数组中的数组)过滤Spark sql?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62108794/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com