gpt4 book ai didi

scala - SparkContext.setCheckpointDir(hdfsPath) 可以在不同的 Spark 应用程序中设置相同的 hdfsPath 吗?

转载 作者:行者123 更新时间:2023-12-04 10:20:25 25 4
gpt4 key购买 nike

作为文档:

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#setCheckpointDir-java.lang.String-

Spark 上下文:

setCheckpointDir
public void setCheckpointDir(String directory)
Set the directory under which RDDs are going to be checkpointed.
Parameters:
directory - path to the directory where checkpoint files will be stored (must be HDFS path if running in cluster)


问题 :
1) 如果不同的 Spark 应用 SparkContext.setCheckpointDir(hdfsPath)设置相同的hdfsPath,有冲突吗?

2) 如果没有冲突,CheckpointDir 的 hdfsPath 会自动清理吗?

最佳答案

问题 :

1)如果不同的spark应用程序SparkContext.setCheckpointDir(hdfsPath)设置相同的hdfsPath,是否有冲突?

答:按照下面给出的示例没有冲突。多个应用程序可以使用相同的检查点目录。将在该唯一哈希类型下创建文件夹以避免冲突。

2) 如果没有冲突,CheckpointDir 的 hdfsPath 会自动清理吗?

答案:正在发生。对于下面的示例,我使用了 local用于演示...但是 localhdfs没关系。行为将是相同的。

让我们举个例子(使用相同的检查点目录多次运行):

package examples

import java.io.File

import org.apache.log4j.Level


object CheckPointTest extends App {
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CheckPointTest").master("local").getOrCreate()
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import spark.implicits._

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")


val csvData1: Dataset[String] = spark.sparkContext.parallelize(
"""
|id
| a
| b
| c

""".stripMargin.lines.toList).toDS()
val frame1 = spark.read.option("header", true).option("inferSchema",true).csv(csvData1).show

val checkpointDir = spark.sparkContext.getCheckpointDir.get
println(checkpointDir)

println("Number of Files in Check Point Directory " + getListOfFiles(checkpointDir).length)


def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
}


结果 :
+---+
| id|
+---+
| a|
| b|
| c|
+---+

file:/tmp/checkpoints/30e6f882-b49a-42cc-9e60-59adecf13166
Number of Files in Check Point Directory 0 // this indicates once application finished removed all the RDD/DS information.


如果您查看检查点文件夹,它将是这样的......

user@f0189843ecbe [~/Downloads]$ ll /tmp/checkpoints/
total 0
drwxr-xr-x 2 user wheel 64 Mar 27 14:08 a2396c08-14b6-418a-b183-a90a4ca7dba3
drwxr-xr-x 2 user wheel 64 Mar 27 14:09 65c8ef5a-0e64-4e79-a050-7d1ee1d0e03d
drwxr-xr-x 2 user wheel 64 Mar 27 14:09 5667758c-180f-4c0b-8b3c-912afca59f55
drwxr-xr-x 2 user wheel 64 Mar 27 14:10 30e6f882-b49a-42cc-9e60-59adecf13166
drwxr-xr-x 6 user wheel 192 Mar 27 14:10 .
drwxrwxrwt 5 root wheel 160 Mar 27 14:10 ..
user@f0189843ecbe [~/Downloads]$ du -h /tmp/checkpoints/
0B /tmp/checkpoints//a2396c08-14b6-418a-b183-a90a4ca7dba3
0B /tmp/checkpoints//5667758c-180f-4c0b-8b3c-912afca59f55
0B /tmp/checkpoints//65c8ef5a-0e64-4e79-a050-7d1ee1d0e03d
0B /tmp/checkpoints//30e6f882-b49a-42cc-9e60-59adecf13166
0B /tmp/checkpoints/

结论 :

1) Even multiple applications are running parllel, there will be unique hash under check point directory in that all the RDD/DS information will be stored.

2) Afer success full execution of each Spark Application, the context cleaner will remove the contents in it.. is what I observed from the above practical example.

关于scala - SparkContext.setCheckpointDir(hdfsPath) 可以在不同的 Spark 应用程序中设置相同的 hdfsPath 吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60890410/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com