java - 在 RDD 方法/闭包中使用 SparkContext hadoop 配置，例如 foreachPartition-6ren

java - 在 RDD 方法/闭包中使用 SparkContext hadoop 配置，例如 foreachPartition

转载作者：可可西里更新时间：2023-11-01 14:17:38

26

4

我正在使用 Spark 读取一堆文件，对它们进行详细说明，然后将它们全部保存为序列文件。我想要的是每个分区有 1 个序列文件，所以我这样做了:

SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
                .setMaster("local[2]")
                .set("spark.streaming.stopGracefullyOnShutdown", "true");
        final JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        jsc.hadoopConfiguration().addResource(hdfsConfPath + "hdfs-site.xml");
        jsc.hadoopConfiguration().addResource(hdfsConfPath + "core-site.xml");
        //JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));

        JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(sourcePath);
        if(!imageByteRDD.isEmpty())
            imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {

                @Override
                public void call(Iterator<Tuple2<String, PortableDataStream>> arg0){
                        throws Exception {
                  [°°°SOME STUFF°°°]
                  SequenceFile.Writer writer = SequenceFile.createWriter(
                                     jsc.hadoopConfiguration(), 
//here lies the problem: how to pass the hadoopConfiguration I have put inside the Spark Context? 
Previously, I created a Configuration for each partition, and it works, but I'm sure there is a much more "sparky way"

有人知道如何在 RDD 闭包内部使用 Hadoop 配置对象吗？

最佳答案

这里的问题是 Hadoop 配置没有被标记为 Serializable，因此 Spark 不会将它们拉入 RDD。它们被标记为Writable，因此 Hadoop 的序列化机制可以对它们进行编码和解码，但 Spark 不直接使用它

两个长期修复选项是

添加对在 Spark 中序列化可写对象的支持。也许SPARK-2421 ？
使 Hadoop 配置可序列化。
添加对序列化 Hadoop 配置的明确支持。

您不会对使 Hadoop conf 可序列化提出任何主要反对意见；如果您实现了委托(delegate)给可写 IO 调用的自定义 ser/deser 方法(并且只是遍历所有键/值对)。我是作为 Hadoop 提交者这么说的。

更新:下面是创建可序列化类的代码，该类确实编码 Hadoop 配置的内容。使用 val ser = new ConfSerDeser(hadoopConf) 创建它；在您的 RDD 中将其称为 ser.get()。

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

 import org.apache.hadoop.conf.Configuration

/**
 * Class to make Hadoop configurations serializable; uses the
 * `Writeable` operations to do this.
 * Note: this only serializes the explicitly set values, not any set
 * in site/default or other XML resources.
 * @param conf
 */
class ConfigSerDeser(var conf: Configuration) extends Serializable {

  def this() {
    this(new Configuration())
  }

  def get(): Configuration = conf

  private def writeObject (out: java.io.ObjectOutputStream): Unit = {
    conf.write(out)
  }

  private def readObject (in: java.io.ObjectInputStream): Unit = {
    conf = new Configuration()
    conf.readFields(in)
  }

  private def readObjectNoData(): Unit = {
    conf = new Configuration()
  }
}

请注意，对于某些人来说，为所有可写类创建泛型会相对简单；您只需要在构造函数中提供一个类名，并在反序列化期间使用它来实例化可写对象。

关于java - 在 RDD 方法/闭包中使用 SparkContext hadoop 配置，例如 foreachPartition，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38224132/

26

4

0

文章推荐： java - 如何启动 hiveserver2 作为服务

文章推荐： hadoop - 在 AWS EMR 上重启 Hive 服务

文章推荐： windows - 通过 .bat 在已经打开的 CMD 中插入命令

rust - 如何将数据移动到多个 Rust 闭包中？
我在一个简单的 GTK 应用程序中有两个小部件: extern crate gdk; extern crate gtk; use super::desktop_entry::DesktopEntry;
javascript - 如何在 ES2015 中将所有属性解构到当前作用域/闭包中？
我想做这样的事情: const vegetableColors = {corn: 'yellow', peas: 'green'}; const {*} = vegetableColors; cons
gradle `it` 属性在 `build.gradle` 闭包中
该属性它存储在 gradle 中的什么位置？ subprojects { println it.class.name // DefaultProject_Decorated depen
javascript - 在 jQuery 闭包中，如何获取窗口属性描述符？
我想在 jQuery 闭包中看到窗口属性“otherName”描述符。但进入 jQuery 闭包 'otherName' 描述符显示未定义，我认为可能是 getOwnPropertyDescrip
java - 将纯文本存储在 Java 8 闭包中
我曾经看过 Douglas Crockford 的一次演讲，在 javascript 的上下文中，他提到将 secret 存储在闭包中可能很有用。我想这可以在 Java 中像这样天真地实现: pub
swift - 闭包中 "self"指的是什么 - Swift
我很难理解 Swift 中闭包中真正发生的事情，希望有人能帮助我理解。 class MyClass { func printWhatever(words: String) {
mysql - 为什么 Laravel 闭包中 undefined variable ？
我有两个 3 表:用户、个人资料、friend_request $my_profile_id变量存储用户个人资料ID的值 $my_user_id = Auth::user()->id; $my_pro
c - 将 C 回调(没有上下文)包装到 Swift 闭包中
我正在尝试通过使用 GLFW 的包装来学习 Swift GLFW 允许添加错误回调: GLFWAPI GLFWerrorfun glfwSetErrorCallback(GLFWerrorfun cb

首页

博学

6Ren·AI

商城

java - 在 RDD 方法/闭包中使用 SparkContext hadoop 配置，例如 foreachPartition