java - 将 Spring 与 Spark 一起使用-6ren

java - 将 Spring 与 Spark 一起使用

转载作者：IT老高更新时间：2023-10-28 13:49:34

我正在开发一个 Spark 应用程序，并且我习惯于将 Spring 作为依赖注入(inject)框架。现在我遇到了一个问题，处理部分使用了 Spring 的 @Autowired 功能，但它是由 Spark 序列化和反序列化的。

所以下面的代码给我带来了麻烦:

Processor processor = ...; // This is a Spring constructed object
                           // and makes all the trouble
JavaRDD<Txn> rdd = ...; // some data for Spark
rdd.foreachPartition(processor);

处理器看起来像这样:

public class Processor implements VoidFunction<Iterator<Txn>>, Serializeable {
    private static final long serialVersionUID = 1L;

    @Autowired // This will not work if the object is deserialized
    private transient DatabaseConnection db;

    @Override
    public void call(Iterator<Txn> txns) {
        ... // do some fance stuff
        db.store(txns);
    }
}

所以我的问题是:是否可以将 Spring 与 Spark 结合使用？如果不是，那么做这样的事情最优雅的方式是什么？任何帮助表示赞赏!

最佳答案

来自提问者:添加:要直接干扰反序列化部分而不修改您自己的类，请使用以下 spring-spark project通过 parapluplu。这个项目会在你的 bean 被 spring 反序列化时 Autowiring 它。

编辑:

要使用 Spark，您需要进行以下设置(参见 this repository):

Spring Boot + Spark:

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>1.5.2.RELEASE</version>
    <relativePath/>
    <!-- lookup parent from repository -->
</parent>

...

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
        <exclusions>
            <exclusion>
                <groupId>ch.qos.logback</groupId>
                <artifactId>logback-classic</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.1.0</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
            <exclusion>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.1.0</version>
    </dependency>

    <!-- fix java.lang.ClassNotFoundException: org.codehaus.commons.compiler.UncheckedCompileException -->
    <dependency>
        <groupId>org.codehaus.janino</groupId>
        <artifactId>commons-compiler</artifactId>
        <version>2.7.8</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.slf4j/log4j-over-slf4j -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>log4j-over-slf4j</artifactId>
        <version>1.7.25</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.5</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.6.4</version>
    </dependency>

</dependencies>

然后你需要应用程序类，就像 Spring Boot 一样:

@SpringBootApplication
public class SparkExperimentApplication {

    public static void main(String[] args) {
        SpringApplication.run(SparkExperimentApplication.class, args);
    }
}

然后是一个将它们绑定(bind)在一起的配置

@Configuration
@PropertySource("classpath:application.properties")
public class ApplicationConfig {

    @Autowired
    private Environment env;

    @Value("${app.name:jigsaw}")
    private String appName;

    @Value("${spark.home}")
    private String sparkHome;

    @Value("${master.uri:local}")
    private String masterUri;

    @Bean
    public SparkConf sparkConf() {
        SparkConf sparkConf = new SparkConf()
                .setAppName(appName)
                .setSparkHome(sparkHome)
                .setMaster(masterUri);

        return sparkConf;
    }

    @Bean
    public JavaSparkContext javaSparkContext() {
        return new JavaSparkContext(sparkConf());
    }

    @Bean
    public SparkSession sparkSession() {
        return SparkSession
                .builder()
                .sparkContext(javaSparkContext().sc())
                .appName("Java Spark SQL basic example")
                .getOrCreate();
    }

    @Bean
    public static PropertySourcesPlaceholderConfigurer propertySourcesPlaceholderConfigurer() {
        return new PropertySourcesPlaceholderConfigurer();
    }
}

然后你可以使用 SparkSession 类与 Spark SQL 进行通信:

/**
 * Created by achat1 on 9/23/15.
 * Just an example to see if it works.
 */
@Component
public class WordCount {
    @Autowired
    private SparkSession sparkSession;

    public List<Count> count() {
        String input = "hello world hello hello hello";
        String[] _words = input.split(" ");
        List<Word> words = Arrays.stream(_words).map(Word::new).collect(Collectors.toList());
        Dataset<Row> dataFrame = sparkSession.createDataFrame(words, Word.class);
        dataFrame.show();
        //StructType structType = dataFrame.schema();

        RelationalGroupedDataset groupedDataset = dataFrame.groupBy(col("word"));
        groupedDataset.count().show();
        List<Row> rows = groupedDataset.count().collectAsList();//JavaConversions.asScalaBuffer(words)).count();
        return rows.stream().map(new Function<Row, Count>() {
            @Override
            public Count apply(Row row) {
                return new Count(row.getString(0), row.getLong(1));
            }
        }).collect(Collectors.toList());
    }
}

引用这两个类:

public class Word {
    private String word;

    public Word() {
    }

    public Word(String word) {
        this.word = word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public String getWord() {
        return word;
    }
}

public class Count {
    private String word;
    private long count;

    public Count() {
    }

    public Count(String word, long count) {
        this.word = word;
        this.count = count;
    }

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public long getCount() {
        return count;
    }

    public void setCount(long count) {
        this.count = count;
    }
}

然后你可以运行看看它返回了正确的数据:

@RequestMapping("api")
@Controller
public class ApiController {
    @Autowired
    WordCount wordCount;

    @RequestMapping("wordcount")
    public ResponseEntity<List<Count>> words() {
        return new ResponseEntity<>(wordCount.count(), HttpStatus.OK);
    }
}

说

[{"word":"hello","count":4},{"word":"world","count":1}]

关于java - 将 Spring 与 Spark 一起使用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30053449/

文章推荐： flutter - 如何确定屏幕高度和宽度

文章推荐： json - Spring和Jackson Json，用 View 序列化不同的字段

文章推荐： java - 集成测试的 Spring Boot 身份验证

apache-spark - Spark 如何处理比 Spark 存储大得多的数据？
目前正在学习 Spark 的类(class)并了解到执行者的定义: Each executor will hold a chunk of the data to be processed. Thisc
apache-spark - Spark 中的任务是什么？ Spark Worker如何执行jar文件？
阅读了有关 http://spark.apache.org/docs/0.8.0/cluster-overview.html 的一些文档后，我有一些问题想要澄清。以 Spark 为例: JavaSp
apache-spark - Spark 调度器与 Spark 堆栈中的独立调度器
Spark核心中的调度器与以下Spark Stack(来自Learning Spark:Lightning-Fast Big Data Analysis一书)中的Standalone Schedule
apache-spark - Apache Spark : setting spark. eventLog.enabled 和 Spark.eventLog.dir 在提交或 Spark 启动时
我想在 spark-submit 或 start 处设置 spark.eventLog.enabled 和 spark.eventLog.dir -all level -- 不要求在 scala/ja
apache-spark - Spark - Spark DataFrame、Spark SQL 和/或 Databricks 表中的混合区分大小写
我有来自 SQL Server 的数据，需要在 Apache Spark (Databricks) 中进行操作。在 SQL Server 中，此表的三个键列使用区分大小写的 COLLATION 选项
apache-spark - spark.local.ip ,spark.driver.host,spark.driver.bindAddress 和 spark.driver.hostname 是什么？
所有这些有什么区别和用途？ spark.local.ip spark.driver.host spark.driver.bind地址 spark.driver.hostname 如何将机器修复为 Sp
apache-spark - 跨多个 Spark 作业重用 Spark session
我有大约 10 个 Spark 作业，每个作业都会进行一些转换并将数据加载到数据库中。必须为每个作业单独打开和关闭 Spark session ，每次初始化都会耗费时间。是否可以只创建一次 Spar
apache-spark - spark 3.0- spark 聚合函数给出了与预期不同的表达式
/Downloads/spark-3.0.1-bin-hadoop2.7/bin$ ./spark-shell 20/09/23 10:58:45 WARN Utils: Your hostname,
apache-spark - 提交 Spark 作业到 Spark 集群
我是 Spark 的完全新手，并且刚刚开始对此进行更多探索。我选择了更长的路径，不使用任何 CDH 发行版安装 hadoop，并且我从 Apache 网站安装了 Hadoop 并自己设置配置文件以了解
apache-spark - Spark 显示的内核数与使用 spark-submit 传递给它的内核数不同
TL; 博士 Spark UI 显示的内核和内存数量与我在使用 spark-submit 时要求的数量不同更多细节: 我在独立模式下运行 Spark 1.6。当我运行 spark-submit 时
apache-spark - Spark pyspark 与 spark-submit
spark-submit 上的文档说明如下: The spark-submit script in Spark’s bin directory is used to launch applicatio
apache-spark - 在同一集群中同时进行 Spark 流和 Spark 批处理作业的最佳实践
关闭。这个问题是opinion-based .它目前不接受答案。想改善这个问题吗？更新问题，以便可以通过 editing this post 用事实和引文回答问题. 6 个月前关闭。 Improve
apache-spark - Spark : Is receiver in spark streaming a bottleneck?
我想了解接收器如何在 Spark Streaming 中工作。根据我的理解，将有一个接收器任务在执行器中运行，用于收集数据并保存为 RDD。当调用 start() 时，接收器开始读取。需要澄清以下内容
apache-spark - 如何使用相同的 spark 上下文并行运行多个 spark 作业？
有没有办法在不同线程中使用相同的 spark 上下文并行运行多个 spark 作业？我尝试使用 Vertx 3，但看起来每个作业都在排队并按顺序启动。如何让它在相同的 spark 上下文中同时运行
apache-spark - 如何在不停止 Spark 流的情况下清理 Spark 历史事件日志
我们有一个 Spark 流应用程序，这是一项长期运行的任务。事件日志指向 hdfs 位置 hdfs://spark-history，当我们开始流式传输应用程序时正在其中创建 application_X
apache-spark - 使用 Spark - Spark JobServer 的基于请求的实时推荐？
我们正在尝试找到一种加载 Spark (2.x) ML 训练模型的方法，以便根据请求(通过 REST 接口(interface))我们可以查询它并获得预测，例如http://predictor.com
apache-spark - spark-sql 与 spark-shell REPL 中的 Spark SQL 性能差异
Spark newb 问题:我在 spark-sql 中进行完全相同的 Spark SQL 查询并在 spark-shell . spark-shell版本大约需要 10 秒，而 spark-sql版
apache-spark - Spark 累加器未显示在 Spark WebUI 中
我正在使用 Spark 流。根据 Spark 编程指南(参见 http://spark.apache.org/docs/latest/programming-guide.html#accumulato
scala - Spark : how to run spark file from spark shell
我正在使用 CDH 5.2。我可以使用 spark-shell 运行命令。如何运行包含spark命令的文件(file.spark)。有没有办法在不使用 sbt 的情况下在 CDH 5.2 中运行/
apache-spark - Spark-Cassandra 与 Spark-Elasticsearch
我使用 Elasticsearch 已经有一段时间了，但使用 Cassandra 的经验很少。现在，我有一个项目想要使用 Spark 来处理数据，但我需要决定是否应该使用 Cassandra 还是

IT老高

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 将 Spring 与 Spark 一起使用