gpt4 book ai didi

java - 在 Scala Spark 中将 CSV 转换为 JSON 以配对 RDD

转载 作者:行者123 更新时间:2023-12-02 09:52:31 25 4
gpt4 key购买 nike

我有 CSV 数据。我首先想将其转换为 Json,然后想将其转换为 Pair RDD

我能够完成这两件事,但我不确定这样做是否有效,而且 key 不是预期的格式。


val df = //some how read the csv data
val dataset = df.toJSON //This gives the expected json.
val pairRDD = dataset.rdd.map(record => (JSON.parseFull(record).get.asInstanceOf[Map[String, String]].get("hashKey"), record))

假设我的架构是


root
|-- hashKey: string (nullable = true)
|-- sortKey: string (nullable = true)
|-- score: number (nullable = true)
|-- payload: string (nullable = true)


In json
{
"hashKey" : "h1",
"sortKey" : "s1",
"score" : 1.0,
"payload" : "data"
}
{
"hashKey" : "h2",
"sortKey" : "s2",
"score" : 1.0,
"payload" : "data"
}

EXPECTED result should be
[1, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[2, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]


ACTUAL result I am getting
[**Some(1)**, {"hashKey" : "1", "sortKey" : "2", "score" : 1.0, "payload" : "data"} ]
[**Some(2)**, {"hashKey" : "h2", "sortKey" : "s2", "score" : 1.0, "payload" : "data"}]

我可以解决这个问题吗?

最佳答案

这是因为get("hashKey")。将其更改为 getOrElse("hashKey","{defaultKey}") - 当您的默认键可以是 "" 或您之前声明的常量时。

更新到更安全的 scala 方式(而不是使用实例)

最好将 json 解析更改为:

dataset.rdd.map(record => JSON.parseFull(record).map{
case json: Map[String, String] => (json.getOrElse("hashKey",""), record)
case _ => ("", "")
}.filter{ case (key, record) => key != "" && record != "") }

关于java - 在 Scala Spark 中将 CSV 转换为 JSON 以配对 RDD,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56221504/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com