python - 如何解决: Very large size tasks in spark-6ren

python - 如何解决: Very large size tasks in spark

转载作者：太空狗更新时间：2023-10-30 01:27:40

25

4

我在这里粘贴我在 spark 上运行的 python 代码，以便对数据执行一些分析。我能够在少量数据集上运行以下程序。但是当出现大数据集时，它说“第 1 阶段包含一个非常大的任务(17693 KB)。建议的最大任务大小为 100 KB”。

import os
import sys
import unicodedata
from operator import add

try:
    from pyspark import SparkConf
    from pyspark import SparkContext
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)

def tokenize(text):
    resultDict = {}
    text = unicodedata.normalize('NFKD', text).encode('ascii','ignore')

    str1= text[1]
    str2= text[0]

    arrText= text.split(str1)

    ss1 = arrText[0].split("/")

    docID = ss1[0].strip()

    docName = ss[1].strip()

    resultDict[docID+"_"+docName] = 1

    return resultDict.iteritems()

sc=SparkContext('local')
textfile = sc.textFile("path to my data")
fileContent = textfile.flatMap(tokenize)
rdd = sc.parallelize(fileContent.collect())
rdd= rdd.map(lambda x: (x[0], x[1])).reduceByKey(add)
print rdd.collect()
#reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("path to result")

我在此发布更多警告:此后作业不再运行。谁能帮我解决这个问题。

16/06/10 19:19:58 WARN TaskSetManager: Stage 1 contains a task of very large size (17693 KB). The maximum recommended task size is 100 KB.
16/06/10 19:19:58 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 5314, localhost, partition 0,PROCESS_LOCAL, 18118332 bytes)
16/06/10 19:19:58 INFO Executor: Running task 0.0 in stage 1.0 (TID 5314)
16/06/10 19:43:00 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:43480 in memory (size: 3.9 KB, free: 511.1 MB)
16/06/10 19:43:00 INFO ContextCleaner: Cleaned accumulator 2

最佳答案

当 Spark 序列化任务时，它会递归地序列化完整的闭包上下文。在这种情况下，罪魁祸首似乎是您在 tokenize 中使用的 unicodedata。我可能是错的，但我没有在代码中看到任何其他繁重的数据结构。 (注意，我通常将 Spark 与 Scala 一起使用，而我的 Python 生锈了。)我想知道该库是否由执行节点上不可用的大量数据结构支持。

处理这类问题的典型模式是:

确保所有库在执行节点上可用。
使用广播变量将繁重的数据结构分发给执行器。

不相关，除非您将其用作调试工具，否则您将使用 collect 将所有数据收集回驱动程序，这些操作不必要。可以链接转换:

sc.textFile(...).flatMap(...).map(...).reduceByKey(add).coalesce(1).saveAsTextFile(...)

关于python - 如何解决: Very large size tasks in spark，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37759385/

25

4

0

文章推荐： python - 从 Python 访问 OrientDB

文章推荐： c# - 为什么 FileDialog 有时不记得初始目录？

文章推荐： c# - 您如何处理 MVVM ViewModel 中的 'SelectedItemChanged' 事件？

文章推荐： python - 无法解析 Python OCR 库 pypdfocr 的依赖关系

c# - "async Task then await Task"与 "Task then return task"
这个问题在这里已经有了答案: Why use async and return await, when you can return Task directly? (8 个答案) 关闭 6 年前。
c++ - inline void addTask(Task task) vs inline void addTask(const Task &task)
这个问题在这里已经有了答案: Are the days of passing const std::string & as a parameter over? (13 个答案) 关闭 8 年前。我
c# - Task.WaitAny 接受 Task 而不是 Task [ ]
我有一组标记为执行的通用任务。当任务完成时(使用 Task.WaitAny )，我将其添加到 ObservableCollection 中. 但是，问题出在 Task.WaitAny(...)行，上面
c# - Task.WhenAll() 和 foreach(var task in tasks) 有什么区别
经过几个小时的努力，我在我的应用程序中发现了一个错误。我认为下面的 2 个函数具有相同的行为，但事实证明它们没有。谁能告诉我引擎盖下到底发生了什么，以及为什么它们的行为方式不同？ public as
python - 织物导入错误 : "fab task" vs. "from fabfile import task; task()"
这也与 Python 的导入机制有关，特别是与在函数内使用 import 有关。使用 Python 2.7.9 和 Fabric 1.10.0，创建以下三个文件: fabfile.py: from a
c# - 如果方法是同步的，则保留 Task 和 Task.FromResult 还是完全删除 Task-stuff？
我有一个 Web API Controller (ASP.NET Core 5)。我的一些 API 是异步的，而其中一些不是。我接下来的问题是:使用 public **Task** WebApiMet
c# - 使用 Task.Start 触发任务时卡在 Task.WaitAll(tasks.ToArray()) 处
我们有类似下面的内容 List uncheckItems = new List(); for (int i = 0; i new Task(async () => await Process
c# - 有效返回 "Task>"吗？或者最好返回 "Task.FromResult(MyObject)"
我的代码没问题，但我想知道哪种风格更好，你会怎么看，我正在玩异步方法。让我建立上下文: Parallel.ForEach(xmlAnimalList, async xml => {
c# - await Task.Factory.StartNew(() => vs Task.Start; await Task;
这两种使用 await 的形式在功能上有什么区别吗？ string x = await Task.Factory.StartNew(() => GetAnimal("feline")); Task m
c# - 关于 Task.Start() 、 Task.Run() 和 Task.Factory.StartNew() 的用法
我刚刚看到 3 个关于 TPL 使用的例程，它们做同样的工作；这是代码: public static void Main() { Thread.CurrentThread.Name = "Ma
c# - 为什么调用不明确？ 'Task.Run(Action)' 和 'Task.Run(Func)'
考虑以下代码: public void CacheData() { Task.Run((Action)CacheExternalData); Task.Run(() => CacheE
c# - 使用 Task.FromResult 将 Task 隐式转换为 Task 其中 T : X?
Task> GetTaskDict() { return Task.FromResult(new Dictionary () ); } 此代码无法编译，因为我们无法在 Task> 到 Tas
asp.net-core - RenderPartialAsync 返回 System.Threading.Tasks.Task`1[System.Threading.Tasks.VoidTaskResult]
我正在使用 ASP.NET 5 RC1 _MyPartial @model MyViewModel @using (Html.BeginForm())
C/C++ VS Code 扩展抛出构建错误 : "The task provider for "C/C+ +"tasks unexpectedly provided a task of type "shell"."
当我尝试在 VS Code 中构建 C 任务时，它显示以下消息: 输出仅显示:The task provider for "C/C++" tasks unexpectedly provided a t
multithreading - 全线程 : Create a task wrapper/modify a task that adds some extra pre- and post processing to an alredy existing task
一些背景: 基本上归结为我希望能够在当前线程中“执行”任务。为什么？ -我有一个任务创建程序例程，有一次我希望任务在后台任务中立即执行，而其他时候我希望使用 IOmniThreadPool 安排任务。
task - Scrum 燃尽图 : Tasks or Stories?
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he
Gulp和运行顺序错误: Task is not configured as a task on gulp
我试图将run-sequence添加到我的gulp工作流程中，但是每次尝试执行使用run-sequence的任务时，都会出现此错误: 任务未配置为gulp上的任务。根据运行序列的来源，这是由以下te
c# - Task 在C#中是非法的？
此代码在VS2015中给出了编译时错误 Error CS0266 Cannot implicitly convert type 'System.Threading.Tasks.Task' to 'Sy
android - Tasks.await(task)显示不适当的阻塞方法调用警告
我正在尝试通过我的代码通过Google登出: suspend fun signOut(context: Context): Boolean = with(Dispatchers.IO) { t
c# - 如何生成 Task 以展开
谁能解释一下这两种说法的区别: Task bTask = backup.BackupCurrentDatabaseAsync() .ContinueWith(_ => CompressArch

首页

博学

6Ren·AI

商城

python - 如何解决: Very large size tasks in spark