gpt4 book ai didi

apache-spark - Spark JobEnd Listner从hdfs路径移动源文件导致找不到文件异常

转载 作者:行者123 更新时间:2023-12-02 20:24:21 25 4
gpt4 key购买 nike

星火版本:2.3

  • Spark流应用程序从hdfs路径流式传输
      Dataset<Row> lines = spark
    .readStream()
    .format("text")
    .load("path");
  • 在对Data进行一些转换之后,对于一个文件,该作业应该处于成功状态。
  • 为工作结束添加了工作列表,并在触发文件时移动文件。
    @Override
    public void onJobENd(SparkListenerJobEnd jobEnd) {
    // Move source file to some other location which is finished processing.
    }
  • 文件已成功移动到另一个位置。但与此同时(精确的时间戳记), Spark 会在文件未找到之后引发异常,这是随机发生的,无法复制。但是经常发生
  • 即使特定作业结束了,spark仍然以某种方式引用该文件。
  • 如何确保作业结束后文件不会被spark引用,并避免找不到该文件的问题
  • 我可以在以下位置找到它:here

  • SparkListenerJobEnd

    DAGScheduler does cleanUpAfterSchedulerStop, handleTaskCompletion, failJobAndIndependentStages, and markMapStageJobAsFinished.

  • Same question with different approach

  • 异常(exception):
        java.io.FileNotFoundException: File does not exist: <filename>
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteExc

    最佳答案

    关于apache-spark - Spark JobEnd Listner从hdfs路径移动源文件导致找不到文件异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58075935/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com