gpt4 book ai didi

hadoop - 我的 sparkDF.persist(DISK_ONLY) 数据存储在哪里?

转载 作者:可可西里 更新时间:2023-11-01 14:18:31 26 4
gpt4 key购买 nike

想进一步了解hadoop out of spark的持久化策略。

当我使用 DISK_ONLY 策略持久化数据帧时,我的数据存储在哪里(路径/文件夹...)?我在哪里指定这个位置?

最佳答案

对于简短的回答,我们可以看看关于 spark.local.dirthe documentation:

Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

为了更深入的理解,我们可以看一下代码:DataFrame(它只是一个Dataset[Row])是基于 RDD 的,它利用相同的持久性机制。 RDD 将其委托(delegate)给 SparkContext ,这将其标记为持久化。然后,该任务实际上由 org.apache.spark.storage 包中的几个类负责:首先,BlockManager 仅管理要持久化的数据 block 以及如何执行的策略,将实际持久化委托(delegate)给 DiskStore(在磁盘上写入时, course) 它代表了一个用于编写的高级接口(interface),而它又具有用于更多低级操作的 DiskBlockManager

希望您了解现在要查看的位置,以便我们继续前进并了解数据实际保存在哪里以及我们如何配置它:DiskBlockManager 调用帮助程序 Utils.getConfiguredLocalDirs,为实用起见,我将在此处复制(取自链接的 2.2.1 版本,撰写本文时的最新版本):

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
getYarnLocalDirs(conf).split(",")
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_DIRECTORY"))
} else {
if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {
logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +
"spark.shuffle.service.enabled is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}

代码,我相信,是非常不言自明的,并且有很好的评论(并且与文档的内容完全匹配):当在 Yarn 上运行时,有一个特定的策略依赖于 Yarn 容器的存储,在 Mesos 中它要么使用 Mesos 沙盒(除非启用了 shuffle 服务),在所有其他情况下,它将转到 spark.local.dirjava.io.tmpdir 下设置的位置>(可能是 /tmp/)。

因此,如果您只是玩玩,数据很可能存储在 /tmp/ 下,否则它在很大程度上取决于您的环境和配置。

关于hadoop - 我的 sparkDF.persist(DISK_ONLY) 数据存储在哪里?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48430366/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com