java - Spark 将每个 Action 执行两次-6ren

java - Spark 将每个 Action 执行两次

转载作者：搜寻专家更新时间：2023-11-01 03:01:15

25

4

我创建了一个简单的 Java 应用程序，它使用 Apache Spark 从 Cassandra 检索数据，对其进行一些转换并将其保存在另一个 Cassandra 表中。

我使用的是 Apache Spark 1.4.1，配置为独立集群模式，在我的机器上有一个主机和一个从机。

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM customer " +
    "WHERE CAST(store_id as string) = '" + storeId + "'");

DataFrame customersWhoOrderedTheProduct = sqlContext.cassandraSql("SELECT email FROM customer_bought_product " +
    "WHERE CAST(store_id as string) = '" + storeId + "' AND product_id = " + productId + "");

// We need only the customers who did not order the product
// We cache the DataFrame because we use it twice.
DataFrame customersWhoHaventOrderedTheProduct = customers
    .join(customersWhoOrderedTheProduct
    .select(customersWhoOrderedTheProduct.col("email")), customers.col("email").equalTo(customersWhoOrderedTheProduct.col("email")), "leftouter")
    .where(customersWhoOrderedTheProduct.col("email").isNull())
    .drop(customersWhoOrderedTheProduct.col("email"))
    .cache();

int numberOfCustomers = (int) customersWhoHaventOrderedTheProduct.count();

Date reportTime = new Date();

// Prepare the Broadcast values. They are used in the map below.
Broadcast<String> bStoreId = sparkContext.broadcast(storeId, classTag(String.class));
Broadcast<String> bReportName = sparkContext.broadcast(MessageBrokerQueue.report_did_not_buy_product.toString(), classTag(String.class));
Broadcast<java.sql.Timestamp> bReportTime = sparkContext.broadcast(new java.sql.Timestamp(reportTime.getTime()), classTag(java.sql.Timestamp.class));
Broadcast<Integer> bNumberOfCustomers = sparkContext.broadcast(numberOfCustomers, classTag(Integer.class));

// Map the customers to a custom class, thus adding new properties.
DataFrame storeCustomerReport = sqlContext.createDataFrame(customersWhoHaventOrderedTheProduct.toJavaRDD()
    .map(row -> new StoreCustomerReport(bStoreId.value(), bReportName.getValue(), bReportTime.getValue(), bNumberOfCustomers.getValue(), row.getString(0), row.getString(1), row.getString(2))), StoreCustomerReport.class);


// Save the DataFrame to cassandra
storeCustomerReport.write().mode(SaveMode.Append)
    .option("keyspace", "my_keyspace")
    .option("table", "my_report")
    .format("org.apache.spark.sql.cassandra")
    .save();

如您所见，我缓存 customersWhoHaventOrderedTheProduct DataFrame，之后我执行一个count并调用toJavaRDD .

根据我的计算，这些操作应该只执行一次。但是当我进入当前作业的 Spark UI 时，我看到以下阶段:

如您所见，每个操作都执行了两次。

我做错了什么吗？有没有我错过的设置？

非常感谢任何想法。

编辑:

在我调用 System.out.println(storeCustomerReport.toJavaRDD().toDebugString()); 之后

这是调试字符串:

(200) MapPartitionsRDD[43] at toJavaRDD at DidNotBuyProductReport.java:93 []
  |   MapPartitionsRDD[42] at createDataFrame at DidNotBuyProductReport.java:89 []
  |   MapPartitionsRDD[41] at map at DidNotBuyProductReport.java:90 []
  |   MapPartitionsRDD[40] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   MapPartitionsRDD[39] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   MapPartitionsRDD[38] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   ZippedPartitionsRDD2[37] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   MapPartitionsRDD[31] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   ShuffledRDD[30] at toJavaRDD at DidNotBuyProductReport.java:89 []
  +-(2) MapPartitionsRDD[29] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[28] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[27] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[3] at cache at DidNotBuyProductReport.java:76 []
     |  CassandraTableScanRDD[2] at RDD at CassandraRDD.scala:15 []
  |   MapPartitionsRDD[36] at toJavaRDD at DidNotBuyProductReport.java:89 []
  |   ShuffledRDD[35] at toJavaRDD at DidNotBuyProductReport.java:89 []
  +-(2) MapPartitionsRDD[34] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[33] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[32] at toJavaRDD at DidNotBuyProductReport.java:89 []
     |  MapPartitionsRDD[5] at cache at DidNotBuyProductReport.java:76 []
     |  CassandraTableScanRDD[4] at RDD at CassandraRDD.scala:15 []

编辑 2:

因此，经过一些研究并结合试验和错误，我设法优化了这项工作。

我从 customersWhoHaventOrderedTheProduct 创建了一个 RDD，并在调用 count() 操作之前将其缓存。 (我将缓存从 DataFrame 移到了 RDD)。

之后，我使用此 RDD 创建了 storeCustomerReport DataFrame。

JavaRDD<Row> customersWhoHaventOrderedTheProductRdd = customersWhoHaventOrderedTheProduct.javaRDD().cache();

现在的阶段是这样的:

如您所见，两个 count 和 cache 现在都没有了，但是仍然有两个“javaRDD”操作。我不知道它们来自哪里，因为我在我的代码中只调用了一次 toJavaRDD。

最佳答案

看起来您在下面的代码段中应用了两个操作

// Map the customers to a custom class, thus adding new properties.
DataFrame storeCustomerReport = sqlContext.createDataFrame(customersWhoHaventOrderedTheProduct.toJavaRDD()
    .map(row -> new StoreCustomerReport(bStoreId.value(), bReportName.getValue(), bReportTime.getValue(), bNumberOfCustomers.getValue(), row.getString(0), row.getString(1), row.getString(2))), StoreCustomerReport.class);


// Save the DataFrame to cassandra
storeCustomerReport.write().mode(SaveMode.Append)
    .option("keyspace", "my_keyspace")

一个在 sqlContext.createDataFrame()，另一个在 storeCustomerReport.write()，这两个都需要 customersWhoHaventOrderedTheProduct.toJavaRDD() .

持久化由产生的 RDD 应该可以解决这个问题。

JavaRDD cachedRdd = customersWhoHaventOrderedTheProduct.toJavaRDD().persist(StorageLevel.DISK_AND_MEMORY) //Or any other storage level

关于java - Spark 将每个 Action 执行两次，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33760644/

25

4

0

文章推荐： ios - 如何将推送主题设置为推送通知的标题？

文章推荐： ios - 计算动态 UILabel 的行数 (iOS7)

文章推荐： iphone - 滑动删除 UITableView

文章推荐： iphone - [NSNull isEqualToString :]

javascript - EmberJS Action - 当包装在 `actions` 中时从另一个 Action 调用一个 Action
当包裹在 EmberJS Controller 的 actions 中时，如何从另一个 Action 调用一个 Action ？使用现已弃用的方式定义操作的原始代码: //app.js App.In
Github Action -完成一个 Action 后触发另一个 Action
我有一个 Action (一个yaml文件)，用于将docker镜像部署到Google Cloud Run。我希望收到通知构建和推送结果的Slack或电子邮件。构建操作完成后，如何触发消息操作？
java - Action 类中的tick(Action action)是什么？
Selenium 的 actions 类中存在的 tick(Action action) 和 tick(Interaction...actions) 方法的用途是什么？是否与点击任何 webElem
actions-on-google - 对话 Action 2023 年日落 : Migrating from conversational actions to Smart Home Actions
简短的背景故事我们目前为数百名用户提供对话操作。我们在过去三年中为我们的一位客户开发了这个 Action 作为“工作”。正如我们最近发现的那样，我们会受到对话行为的影响。当然，我们现在正在研究如何
uml - 在事件图中，由于一个 Action 包含在另一个 Action 中，是否可以 fork 成两个 Action 但在加入时只有一个 Action ？
考虑系统用户可以并发方式执行两个操作，第一个操作 (A1) 仅对用户的订单执行，第二个操作 (A2) 包括在执行时执行 (A1)，如下面的使用所述-案例图..((考虑A1完全执行U1，A2完全执行U2
android - Action 项目系统地堆叠在 Action 溢出中，在 Action 栏中
我正在为 android 中的 ActionBar 而苦苦挣扎。这是我的问题:我的操作项没有显示在操作栏中，而是堆叠在操作溢出中，无论我做什么.. 我花了一天的时间寻找解决方案，但我似乎找不到缺少的
github-actions - 如何将 Action 的输出用作 Github Action 工作流程的 if 条件中的表达式？
我正在构建一个工作流，其中一个操作为工作流中的一个步骤提供条件。我该如何使用这个值？该操作的值为空，因此计算结果为 false，并且从未部署过任何内容... jobs: build: s
redux - 像显示/隐藏加载屏幕这样的 Action 应该由相关 Action 的reducer处理还是由 Action 创建者自己生成？
鉴于您有一些全局 View (例如，显示加载屏幕)，您可能希望在许多情况下发生这种情况，为该行为创建一个 Action 创建者/ Action 对还是为相关 Action 创建 reducer 更合适
actions-on-google - Actions on Google 启动自定义操作(不是主要的 actions.intent.MAIN)
我有一个使用 DialogFlow 构建的 Actions on Google 代理，其中包含多个操作(例如 actions.intent.MAIN 和 get_day_of_week)。当我在 3
github-actions - 如何从 GitHub Action 的 action.yml 文件中引用其他操作？
是否可以从我的 action.yml 文件中引用另一个 GitHub 操作？请注意，我在这里谈论的是操作，而不是工作流程。我知道这可以通过工作流来完成，但是操作可以引用其他操作吗？最佳答案答案似
javascript - 如何从一个 Action 派发另一个 Action 并在 Vuex 中派发另一个 Action
在 Vuex 操作中，我们有以下实现。 async actionA({ commit, dispatch }) { const data = this.$axios.$get(`/apiUrl`)
java - 正在调用 struts.xml 中定义的 Action ，但未调用 Action 包中存在的 Action
我正在将我的应用程序服务器从 Jboss 4.2 迁移到 7.1。我在 Struts 配置中收到以下错误。 struts.xml 中定义的 Action 被调用，而 Action 包中的操作未被调用。
java - 将 Action 重定向(使用拦截器)到其他 Action 时无法执行 Struts2 Action
我向 ActLand 发送请求，然后 intercept()，如果没有登录则重定向到 Login.jsp。 struts.xml:
javascript - Action 创建者是否有必要返回 Action ？
我有一个 Action 创建器，它接受一个 id 和一个回调函数。它向服务器发送请求以执行某些操作并返回一个虚拟操作。我在这里想做的就是调用回调函数并退出，因为该虚拟操作对我来说没有用处，例如喜欢帖子
c# - Action 链接到子 Action
我已经使用 Html.Action 方法调用了另一个 View 。当用户单击操作链接时，我想在 subview 内使用参数调用相同的操作。当我写这段代码时，我得到了这个错误信息: Html.Acti
c# - Action<> 与事件 Action
是 public event Action delt = () => { Console.WriteLine("Information"); }; 的重载版本 Action delg = (a, b)
java从另一个 Action 调用 Action
countresultsfrom.addActionListener(new ActionListener() { public void actionPerforme
c# - Action 是什么意思？
我刚刚看到一个 brand-new video在 Rx 框架上，一个特别的签名引起了我的注意: Scheduler.schedule(this IScheduler, Action) 在 23:55，
actions-on-google - Google Action 和 DialogFlow 错误 "Sorry, this action is not available for your app"
我创建了一个在我的开发者帐户中完美运行的 DialogFlow 应用程序。但我需要以另一个用户的身份对其进行测试，因此在我的 Google Action 模拟器中，我添加了另一个测试帐户作为项目的所
java - 如何在 Action 链调用上的另一个 Action 类之后访问 Jsp 中的一个 Action 类 ActionMessages
我正在尝试实现消息存储拦截器以在我的 JSp 上显示 ActionMessage，但无法访问 ActionMessage。有人可以提供一个链接如何实现消息存储拦截器吗？最佳答案这是我的一个应用程序

首页

博学

6Ren·AI

商城

java - Spark 将每个 Action 执行两次