gpt4 book ai didi

scala - 查找 Spark 数据帧中两列的差异并附加到新列

转载 作者:行者123 更新时间:2023-12-04 13:23:49 26 4
gpt4 key购买 nike

下面是我将 csv 数据加载到数据框并在两列上应用差异并使用 withColumn 附加到新列的代码。我试图找到差异的两列是 Double 类型。请帮助我找出以下异常:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

/**
* Created by Guest1 on 5/10/2017.
*/
object arith extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)

val spark = SparkSession.builder().appName("Arithmetics").
config("spark.master", "local").getOrCreate()
val df =spark.read.option("header","true")
.option("inferSchema",true")
.csv("./Input/Arith.csv").persist()

// df.printSchema()
val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
}

我收到以下异常:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description, Total Discharges , Average Covered Charges , Average Total Payments , Average Medicare Payments);
at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at arith$.delayedEndpoint$arith$1(arith.scala:19)
at arith$delayedInit$body.apply(arith.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at arith$.main(arith.scala:7)
at arith.main(arith.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

最佳答案

这里有多个问题。

首先,如果您查看异常,它基本上会告诉您数据框中没有“平均总付款”列(它还有助于为您提供它看到的列)。从 csv 读取的列名似乎在末尾有一个额外的空格。

第二个 df("Average Total Payments") 和 df("Average Medicare Payments") 是列。

您正在尝试在 df("Average medic Payment") 上调用 show。 Show 不是列的成员(并且在数据框中它返回单位,因此您不能执行 df("Average Total Payments") -df("Average Medicare Payments").show(5) 无论如何,因为那将是 Column - Unit )。

您想要做的是定义一个新列,这是两者之间的差异,并将其作为新列添加到数据框中。然后您只想选择该列并显示它。例如:

val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)

关于scala - 查找 Spark 数据帧中两列的差异并附加到新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43917836/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com