- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用Oracle虚拟机通过Spark将HDFS [/tmp/chicago_test_load/chicago_crimes_01_present.csv
]中的Chicago Crime数据附加到Hortonworks Sandbox的配置单元表中。
Last login: Mon Feb 18 06:27:47 2019 from 172.18.0.3
[maria_dev@sandbox-hdp ~]$ spark-shell
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://sandbox-hdp.hortonworks.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1550484326924).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0.2.6.5.0-292
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
val spark = SparkSession.builder().appName("StatsAnalyzer").enableHiveSupport().config("hive.exec.dynamic.partition","true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
val sqlContext = new HiveContext(sc)
var dr = sqlContext.read.format("com.databricks.spark.csv").option("delimeter", ",")
var df = dr.load("/tmp/chicago_test_load/chicago_crimes_01_present.csv")
var header = df.first()
df = df.filter(row => row != header)
df.show()
来详细查看数据框时,下面出现错误。
scala> df.show()
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:843)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:843)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:337)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 51 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.DataFrameReader
Serialization stack:
- object not serializable (class: org.apache.spark.sql.DataFrameReader, value: org.apache.spark.sql.DataFrameReader@8dedec8)
- field (class: $iw, name: dr, type: class org.apache.spark.sql.DataFrameReader)
- object (class $iw, $iw@485c8d5e)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@8dc73c7)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@363502a4)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6112390a)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@2f541378)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@7d6e3e42)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@5fde8cf4)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@44556fbd)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@4d36c8cb)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@f03dc9b)
- field (class: $line19.$read, name: $iw, type: class $iw)
- object (class $line19.$read, $line19.$read@772ccecf)
- field (class: $iw, name: $line19$read, type: class $line19.$read)
- object (class $iw, $iw@41c02b8c)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@520dc647)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@6e52877d)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@4273c8be)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@62c63900)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@17a364)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@19b0f61c)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@b230f87)
- field (class: $iw, name: $iw, type: class $iw)
- object (class $iw, $iw@300788d6)
- field (class: $line20.$read, name: $iw, type: class $iw)
- object (class $line20.$read, $line20.$read@1eb9ab8f)
- field (class: $iw, name: $line20$read, type: class $line20.$read)
- object (class $iw, $iw@4ce1292a)
- field (class: $iw, name: $outer, type: class $iw)
- object (class $iw, $iw@1177049c)
- field (class: $anonfun$1, name: $outer, type: class $iw)
- object (class $anonfun$1, <function1>)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 4)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 82 more
最佳答案
不能这样做,将数据帧传递给Worker!通过 header 的。
有一些注意事项,如果路径中有多个文件怎么办?
尝试这种方法,我认为这要容易得多,您应该使用val not var:
val dfother = spark.read.option("header", "true").csv("file path").toDF("col1")
dfother.show(false)
var ds = spark.read.text("/FileStore/tables/fff02.txt", "/FileStore/tables/fff01.txt").as[String].mapPartitions(_.drop(1))//.toDF
关于java - Spark-Shell:org.apache.spark.SparkException:任务无法序列化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54755842/
我知道Stackoverflow上有other very similar个问题,但是这些问题没有得到回答或没有帮助我。与这些问题相比,我在此问题中添加了更多的堆栈跟踪和日志文件信息。我希望这会有所帮助
为什么这段代码会产生这个异常?怎样才能避免呢 SparkConf conf = new SparkConf().setAppName("startingSpark").setMaster("l
这是一个工作代码示例: JavaPairDStream messages = KafkaUtils.createStream(javaStreamingContext, zkQuorum, group
我正在使用 YARN 在 Hadoop 集群上运行以下代码。它解析一些电子邮件并执行情感注释,最后将结果 DataFrame 写入 HDFS 上的 Parquet 表。不幸的是,它在 HDFS 上最后
我正在从 HDFS 检查点恢复流(例如,ConstantInputDSTream),但我不断收到 SparkException: has not been initialized . 从检查点恢复时
我正在 EMR 中使用 YARN 作为资源管理器并在 2 个节点上运行 Spark 作业。如果不满足我的条件,我需要故意使该步骤失败,因此下一步不会按照配置执行。 为了实现这一点,我在 dynamoD
我正在尝试使用Oracle虚拟机通过Spark将HDFS [/tmp/chicago_test_load/chicago_crimes_01_present.csv]中的Chicago Crime数据
我正在尝试运行以下简单的 Spark 代码: public static void main(final String[] args)throws Exception { ClassLoade
我正在处理两个 pyspark 数据框,并对它们进行左反连接以跟踪日常更改,然后发送电子邮件。 我第一次尝试: diff = Table_a.join( Table_b, [Table
我正在搜索此错误,但没有找到与 TrainValidationSplit 相关的任何内容。所以我想进行参数调整,并使用 TrainValidationSplit 执行此操作会出现以下错误:org.ap
我试图通过 foreachpartition 将结果添加到 mysql,但收到错误 org.apache.spark.SparkException:任务不可序列化 java。 公共(public)类
我正在尝试运行以下简单的 Spark 代码: Gson gson = new Gson(); JavaRDD stringRdd = jsc.textFile("src/main/resources/
我正在 Hadoop-Yarn 集群上执行 spark-submit 作业。 spark-submit/opt/spark/examples/src/main/python/pi.py 1000 但面
我的 spark 结构化流数据帧需要一个 JDBC 接收器。目前,据我所知,DataFrame 的 API 缺乏 writeStream到 JDBC 实现(既不在 PySpark 也不在 Scala(
我是这个主题的新手,我使用基于推送的方法并且它有效,但不知何故使用基于拉的方法它会引发接收器连接错误。也许我可能会错过一些东西。 Flume配置详情如下 sink.type=org.apache.sp
我的代码如下所示(抱歉,我无法显示完整代码): public class MyClass { final A _field1; // Non-serializable object f
我正在用 java 编写我的第一个 Spark 程序,但无法找出以下错误。我已经解决了很多有关堆栈溢出的问题,但他们认为与我的问题无关。我正在尝试使用最新版本的spark 2.4.4。我正在本地运行我
我有以下三个类(class),我正在学习 Task not serialized 错误。完整的堆栈跟踪见下文。 头等舱是一个序列化的人: public class Person implements
我正在尝试设置 Spark Streaming 以从 Kafka 队列中获取消息。我收到以下错误: py4j.protocol.Py4JJavaError: An error occurred whi
我是 Scala 的新手,我正在尝试执行以下代码: val SetID = udf{(c:String, d: String) => if( c.UpperCase.contains("EXK
我是一名优秀的程序员,十分优秀!