apache-spark - FileNotFoundException : Spark save fails. 无法从数据集 [T] avro 清除缓存-6ren

apache-spark - FileNotFoundException : Spark save fails. 无法从数据集 [T] avro 清除缓存

转载作者：行者123 更新时间：2023-12-02 02:52:20

第二次在 avro 中保存数据帧时出现以下错误。如果我在保存后删除 sub_folder/part-00000-XXX-c000.avro，然后尝试保存相同的数据集，我会得到以下信息:

FileNotFoundException: File /.../main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

如果我不仅从 sub_folder 中删除，而且从 main_folder 中删除，那么问题就不会发生，但我负担不起。
尝试以任何方式保存数据集时实际上并没有发生问题其他格式。
保存空数据集不会导致错误。

该示例表明需要刷新表格，但作为 sparkSession.catalog.listTables().show() 的输出，没有要刷新的表格。

+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+

之前保存的数据框看起来像这样。应用程序应该更新它:

+--------------------+--------------------+
|              Col1  |               Col2 |
+--------------------+--------------------+
|[123456, , ABC, [...|[[v1CK, RAWNAME1_,..|
|[123456, , ABC, [...|[[BG8M, RAWNAME2_...|
+--------------------+--------------------+

对我来说，这是一个明显的缓存问题。但是，所有清除缓存的尝试都失败了:

 dataset.write
      .format("avro")
      .option("path", path)
      .mode(SaveMode.Overwrite) // Any save mode gives the same error
      .save()

// Moving this either before or after saving doesnt help.
sparkSession.catalog.clearCache()  

// This will not un-persist any cached data that is built upon this Dataset.
dataset.cache().unpersist()
dataset.unpersist()

这就是我读取数据集的方式:

private def doReadFromPath[T <: SpecificRecord with Product with Serializable: TypeTag: ClassTag](path: String): Dataset[T] = {

    val df = sparkSession.read
      .format("avro")
      .load(path)
      .select("*")

    df.as[T]
  }

最后的堆栈跟踪是这个。非常感谢您的帮助!:

ERROR [task-result-getter-3] (Logging.scala:70) - Task 0 in stage 9.0 failed 1 times; aborting job
ERROR [main] (Logging.scala:91) - Aborting job 150de02a-ac6a-4d42-824d-5db44a98c19a.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 11, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/DATA/XXX/main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
    ... 10 more

最佳答案

*Reading from the same location and writing in to same location will give this issue. it was also discussed in this forum. along with my answer there *

错误中的以下消息将误导。但实际问题是在同一位置读取/写入。

You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL

我给出了另一个例子而不是你的例子(在你的案例中使用了 parquet avro)。

我有 2 个选项供您选择。

选项 1(cache 和 show 的工作方式如下...):

import org.apache.spark.sql.functions._
  val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")

  df.show(false)

  df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
  val df1 = spark.read.format("parquet").load(".../temp") // read back again

 val df2 = df1.withColumn("cleanup" , lit("Rod want to cleanup")) // like you said you want to clean it.

  //BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.

  df2.cache // cache to avoid FileNotFoundException
  df2.show(2, false) // light action to avoid FileNotFoundException
   // or println(df2.count) // action

   df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
  println("Rod saved in same directory where he read it from final records he saved after clean up are  ")
  df2.show(false)

选项 2:

1) save the DataFrame with a different avro folder.

2) Delete the old avro folder.

3) Finally rename this newly created avro folder to the old name, will work.

关于apache-spark - FileNotFoundException : Spark save fails. 无法从数据集 [T] avro 清除缓存，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61725822/

文章推荐： angular - 如何从 Angular 中的 Observable 中提取数据

文章推荐： c# - 如何从代码运行 Azure DevOps 管道？

文章推荐： sql - 多个 INSERT INTO 语句在 dbeaver 中不起作用

html - 清除 float div(清除 :both; not working )
我有一个网站，我正在通过学校参加比赛，但我在清除 float 元素方面遇到了问题。该网站托管在 http://www.serbinprinting.com/corey/development/
jquery - 如何使用 JQuery 清除“清除”按钮上的文本
我有一个清除按钮，需要使用 JQuery 函数清除该按钮单击时的 TextBox 值(输入的)。最佳答案您只需将单击事件附加到按钮即可将输入元素的值设置为空。 $("#clearButton").
swift - 清除/清除 CloudKit 容器的所有用户 iCloud 记录
我们已经创建了一个保存到 CoreData 然后同步到 CloudKit 的 iOS 应用程序。在测试中，我们还没有找到一种方法来清除应用程序 iCloud 容器中的数据(用于用户私有(private
html - hr 清除 vs div 清除。哪个更好？
这是一个普遍的问题，也是我突然想到并且似乎有道理的问题。我看到很多人使用清除div 并且知道这有时不受欢迎，因为它是额外的标记。我最近开始使用因为它接缝代表了它的实际用途。当然都引用了:.clea
WPF ComboBox 清除
我有两个单选按钮。如果我检查第一个单选按钮下面的数据将填充在组合框中。之后我将检查另一个单选按钮，我想清除组合框值。 EmployeeTypes _ET = new EmployeeTypes(
javascript - 间隔没有被clearInterval()清除
我一直在玩 Canvas ，我正在尝试制作一个可以移动和跳跃的正方形，移动部分已经完成，但是跳跃部分有一个问题:每次跳跃时它都会跳得更快 here's a jsfiddle 这是代码: ///////
dart - 清除/清空tbody元素的所有内容？
我该如何在 Dart 上做到这一点？抓取tbody元素后，我想在其上调用empty()，但这似乎不存在: var el = query('#search_results_tbody'); el.em
Java JPanel 清除
我需要创建一个二维模拟，但是在设置新的“框架”时，旧的“框架”不会被清除。我希望一些圆圈在竞技场中移动，并且每个循环都应删除旧圆圈并生成新圆圈。一切正常，但旧的没有被清除并且仍然可见，这就是我需要改
Vim 状态行未更改/清除
无论我使用set statusline将状态行更改为什么，我的状态行都不会改变。看起来像 ".vimrc" 39L, 578C
wpf - 清除 ObservableCollection
在 WPF 应用程序中，我有一个 ListView 绑定(bind)到我的 ViewModel 上的一个 ObservableCollection。在应用程序运行期间，我需要删除并重新加载集合中的所
清除 C 中的输入缓冲区
我有一个大型程序，一个带有图形的文本扭曲游戏。在我的代码中的某处，我使用 kbhit() 我执行此代码来清除我的输入缓冲区: while ((c = getchar()) != '\n' && c !
javascript - 清除#而不重新加载页面
我正在将所有网站的页面加载到主索引页面中，并通过将 href 分成段并在主域名后使用 .hash 函数添加段来更新 URL 显示，如下所示: $('a').click(function(event)
c# - 清除 __eventArgument
我有一个带有的表单和 2 控件来保存和重置表单。我正在触发使用 javascript __doPostBack()函数并在其中传递一个值 __EVENTARGUMENT如果面板应该重置。我的代
ios - 清除 UIViewController
我目前有一堆 UIViewController，每个都是在前一个之上呈现的模式 ViewController。我的问题是我不需要一堆 UIViewController，我只需要最后一个。因此，当出现新
python - 清除@property方法python的缓存
我在一个类中有一些属性方法，我想在某个时候清除这个属性的缓存。示例: class Test(): def __init__(self): pass @property
css - 清除 : both: 时遇到问题
在此Test Link我试图将标题和主站点导航安装到博客脚本的顶部。我清除:两者；在主要网站脚本上工作，但现在把所有东西都扔到了一边。尝试了无数次 fixex 都没有成功!提前感谢 Ant 指点解决
CSS 清除 :both not working
我似乎无法正确清除布局。看this 我无法阻止左栏中的元素向下推右栏中的元素。谁能帮忙？ Screenshot with some pointy arrows (死链接) 最佳答案问题标记/样式似
css - 清除 元素后的内容
我希望能够在某个类 (sprite-empos) 之后清除 '' 中的内容，想知道是否有不添加任何新类或不使用 js 的方法(我在下面尝试过不工作)？为了明确它是“985”，我想在某个视口(view

c++ - 清除 ptr_array
我想清除ptr_array boost::ptr_array a; ... a.clear(); // missing 如何清理 ptr 容器？最佳答案它应该表现得像一个数组，您不能在 C++

c++ - multimap 清除
这是我使用多 map 制作的一个简单的事件系统；当我使用 CEvents::Add(..) 方法时，它应该插入并进入多重映射。问题是，当我触发这些事件时， multimap 似乎是空的。我确定我没有调

行者123

个人简介
我是一名优秀的程序员,十分优秀！

作者热门文章

html - 出于某种原因，IE8 对我的 Sass 文件中继承的 html5 CSS 不友好？

JMeter 在响应断言中使用 span 标签的问题

html - 在 :hover and :active? 上具有不同效果的 CSS 动画

html - 相对于居中的 html 内容固定的 CSS 重复背景？

滴滴打车优惠券免费领取

全站热门文章

新手入门Java自动化测试的利器：SeleniumWebDriver

TinyVuev3.19.0正式发布！Tree组件终于支持虚拟滚动啦！UI也升级啦，更更符合现代审美~

鸿蒙NEXT开发案例：转盘

用Java实现samza转换成flink

利用Screen保持VSCode连接远程任务持续运行

Borůvka算法

汉文博士词典编译配置文件概述

openEuler搭建k8s(1.28.2版本)

Nuxt.js应用中的listen事件钩子详解

vue通过ollama接口调用开源模型

首页

博学

6Ren·AI

商城

apache-spark - FileNotFoundException : Spark save fails. 无法从数据集 [T] avro 清除缓存