gpt4 book ai didi

python - Spark - 创建嵌套数据框

转载 作者:太空狗 更新时间:2023-10-29 17:44:43 25 4
gpt4 key购买 nike

我刚开始使用 PySpark,但在使用嵌套对象创建 DataFrame 时遇到了麻烦。

这是我的例子。

我有用户。

$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}

用户有订单。

$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}

我喜欢加入它以获得这样一个结构,其中订单数组嵌套在用户中。

$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}

我该怎么做?是否有任何类型的嵌套连接或类似的东西?

>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)

>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)

>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)

编辑:我知道可以使用 join 和 foldByKey 来做到这一点,但是有没有更简单的方法?

编辑2:我正在使用@zero323 的解决方案

def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")

我添加第二个嵌套结构“行”

>>> lines =  sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)

orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()

root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)

在此之后,行中的列名将丢失。有什么想法吗?

编辑 3:我尝试手动指定架构。

from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema

grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)

错误:

TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>

最佳答案

这仅适用于 Spark 2.0 或更高版本

首先我们需要一些导入:

from pyspark.sql.functions import struct, collect_list

剩下的就是一个简单的聚合和连接:

orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")

combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])

对于示例数据,结果是:

combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+

有模式:

combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)

和 JSON 表示:

for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}

关于python - Spark - 创建嵌套数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31919371/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com