nested - Pyspark - 使用 collect_list 时保留空值-6ren

nested - Pyspark - 使用 collect_list 时保留空值

转载作者：行者123 更新时间：2023-12-04 20:29:46

25

4

根据接受的答案 在 pyspark collect_set or collect_list with groupby ，当你做 collect_list在某列上，null此列中的值被删除。我已经检查过，这是真的。

但在我的情况下，我需要保留空列——我怎样才能做到这一点？

我没有找到任何关于这种 collect_list 变体的信息。功能。

解释为什么我想要空值的背景上下文:

我有一个数据框 df如下:

cId   |  eId  |  amount  |  city
1     |  2    |   20.0   |  Paris
1     |  2    |   30.0   |  Seoul
1     |  3    |   10.0   |  Phoenix
1     |  3    |   5.0    |  null

我想使用以下映射将其写入 Elasticsearch 索引:

"mappings": {
    "doc": {
        "properties": {
            "eId": { "type": "keyword" },
            "cId": { "type": "keyword" },
            "transactions": {
                "type": "nested", 
                "properties": {
                    "amount": { "type": "keyword" },
                    "city": { "type": "keyword" }
                }
            }
        }
    }
 }

为了符合上面的嵌套映射，我转换了我的 df 以便对于 eId 和 cId 的每个组合，我有一个这样的交易数组:

df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
 |-- cId: integer (nullable = true)
 |-- eId: integer (nullable = true)
 |-- transactions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- amount: float (nullable = true)
 |    |    |-- city: string (nullable = true)

保存 df_nested作为一个 json 文件，我得到了 json 记录:

{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}

如您所见 - 当 cId=1和 eId=3 ，我的数组元素之一，其中 amount=30.0没有 city属性，因为这是一个 null在我的原始数据中( df )。当我使用 collect_list 时，空值被删除功能。

但是，当我尝试使用上述索引将 df_nested 写入 elasticsearch 时，由于模式不匹配而出错。这基本上就是为什么我想在应用 collect_list 后保留我的空值的原因。功能。

最佳答案

    from pyspark.sql.functions import create_map, collect_list, lit, col, to_json, from_json
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, HiveContext, SparkSession, types, Row
    from pyspark.sql import functions as f
    import os
    
    app_name = "CollList"
    conf = SparkConf().setAppName(app_name)
    spark = SparkSession.builder.appName(app_name).config(conf=conf).enableHiveSupport().getOrCreate()
    
    df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"],
        [1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]],
        ["cId", "eId", "amount", "city"])
    print("Actual data")
    df.show(10,False)
```
Actual data
+---+---+------+-------+
|cId|eId|amount|city   |
+---+---+------+-------+
|1  |2  |20.0  |Paris  |
|1  |2  |30.0  |Seoul  |
|1  |3  |10.0  |Phoenix|
|1  |3  |5.0   |null   |
+---+---+------+-------+
```
    #collect_list that skips null columns
    df1 = df.groupBy(f.col('city'))\
            .agg(f.collect_list(f.to_json(f.struct([f.col(x).alias(x) for x in (c for c in df.columns if c != 'cId' and c != 'eId' )])))).alias('newcol')
    print("Collect List Data - Missing Null Columns in the list")
    df1.show(10, False)
```
Collect List Data - Missing Null Columns in the list
+-------+-------------------------------------------------------------------------------------------------------------------+
|city   |collect_list(structstojson(named_struct(NamePlaceholder(), amount AS `amount`, NamePlaceholder(), city AS `city`)))|
+-------+-------------------------------------------------------------------------------------------------------------------+
|Phoenix|[{"amount":10.0,"city":"Phoenix"}]                                                                                 |
|null   |[{"amount":5.0}]                                                                                                   |
|Paris  |[{"amount":20.0,"city":"Paris"}]                                                                                   |
|Seoul  |[{"amount":30.0,"city":"Seoul"}]                                                                                   |
+-------+-------------------------------------------------------------------------------------------------------------------+
``` 
    my_list = []
    for x in (c for c in df.columns if c != 'cId' and c != 'eId' ):
        my_list.append(lit(x))
        my_list.append(col(x))
    
    grp_by = ["eId","cId"]
    df_nested = df.withColumn("transactions", create_map(my_list))\
                  .groupBy(grp_by)\
                  .agg(collect_list(f.to_json("transactions")).alias("transactions"))
    
    print("collect list after create_map")
    df_nested.show(10,False)
```
collect list after create_map
+---+---+--------------------------------------------------------------------+
|eId|cId|transactions                                                        |
+---+---+--------------------------------------------------------------------+
|2  |1  |[{"amount":"20.0","city":"Paris"}, {"amount":"30.0","city":"Seoul"}]|
|3  |1  |[{"amount":"10.0","city":"Phoenix"}, {"amount":"5.0","city":null}]  |
+---+---+--------------------------------------------------------------------+
```

关于nested - Pyspark - 使用 collect_list 时保留空值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49395458/

25

4

0

文章推荐：正方体语法错误 :'Creating User Config File' 错误

文章推荐： pdf - 将 Photoshop 文档另存为 .pdf 会导致图像模糊/像素化

文章推荐： cmake - 调试 CMake find_library

scala - collect_list() 是否保持行的相对顺序？
想象一下，我有以下 DataFrame df: +---+-----------+------------+ | id|featureName|featureValue| +---+---------
python - collect_list 通过保留基于另一个变量的顺序
我正在尝试使用对现有列集的 groupby 聚合在 Pyspark 中创建一个新的列表列。下面提供了一个示例输入数据框: ------------------------ id | date
java - groupBy 与结构失败的 collect_list
这看起来确实像一个错误，但我找不到原因，也找不到互联网上的任何信息发生了什么:我有一些 java 代码，在 groupBy 之后的 agg 方法中使用 collect_list(struct(...
sql - Hive - collect_list 有多列？
说我的表是这样的: Name,Subject,Score Jon,English,80 Amy,Geography,70 Matt,English,90 Jon,Math,100 Jon,Histor
hadoop - 在Hive中使用 “Collect_List()”函数时出错
每当我在Hive上运行函数“collect_list”时，它总是会引发错误: Query ID = xxxxx Total jobs = 1 Launching Job 1 out of 1 Fail
scala - Spark Collect_list 并限制结果列表
我有以下格式的数据框: name merged key1 (internalKey1, value1) key1 (internalKey2, value2) ... k
Hadoop/Hive Collect_list 没有重复项
根据帖子，Hive 0.12 - Collect_list ，我试图找到 Java 代码来实现一个 UDAF，它将完成这个或类似的功能，但没有重复序列。例如，collect_all() 返回一个序列
hive - Hive 的 collect_list 是否有序？
This page说到 collect_list: Returns a list of objects with duplicates. 那个 list 是有序的吗？比如查询结果的顺序？最佳答案正
nested - Pyspark - 使用 collect_list 时保留空值
根据接受的答案在 pyspark collect_set or collect_list with groupby ，当你做 collect_list在某列上，null此列中的值被删除。我已经检查过
python - 如何使用 pyspark collect_list 函数检索所有列
我有一个 pyspark 2.0.1。我正在尝试对我的数据框进行分组并从我的数据框中检索所有字段的值。我发现 z=data1.groupby('country').agg(F.collect_list
sql - collect_list 保持秩序(sql/spark scala)
我有一张这样的 table : Clients City Timestamp 1 NY 0 1 WDC 10 1 NY
hive - 在 hive 中的 collect_list() 中排序
假设我有一个看起来像这样的 hive 表: ID event order_num ------------------------ A red 2 A
scala - 在 Spark SQL 中的一个查询中使用多个 collect_list
我有以下数据框 data : root |-- userId: string |-- product: string |-- rating: double 以及以下查询: val result
apache-spark - Spark ;检查元素是否在 collect_list 中
这个问题在这里已经有了答案: How to filter based on array value in PySpark? (2 个回答) 3年前关闭。我正在处理一个数据框 df ，例如以下数据框:
database - Hive 中 collect_list(column) 的最大值
我在 Hive 中使用以下命令。并得到正确的结果。 select acct_id,collect_list(expr_dt) from experiences > group by acct_
hadoop - 从 Hive 中的 collect_list 结果构造映射
一系列 UNION ALL 生成我想用来构建 MAP 的键值对列表。所需的功能是这样的: select id1, id2, map(collect_list(col)) as measurement
hadoop - Hive collect_list() 不收集 NULL 值
我正在尝试收集包含 NULL 的列以及该列中的一些值...但是 collect_list 忽略了 NULL并仅收集其中具有值(value)的那些。有没有一种方法可以检索 NULL 以及其他值？ SEL
apache-spark - 如何在 spark sql 中嵌套 collect_list？
我是数据砖 Spark SQL 的新手。我正在寻找嵌套的 collect_list 并试图找出答案。下面是我的 spark 实际 sql 查询 select policy.C
python - 将带有 collect_list(column) 的 spark 数据帧转换回长格式
假设我们有虹膜数据框: import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa
在 PySpark 中使用 collect_list 时 Java 内存不足
我目前正在使用 PySpark 并在包含大约 6 亿条记录的表上运行查询。该表本身约为 300gb。我的查询看起来像这样: select f1, collect_list(struct(f2, f3)

首页

博学

6Ren·AI

商城

nested - Pyspark - 使用 collect_list 时保留空值