gpt4 book ai didi

python - 如何在 Jupyter Notebook 中正确设置 SparkContext 的配置?

转载 作者:行者123 更新时间:2023-12-05 06:55:20 31 4
gpt4 key购买 nike

我是 spark 的新手,我尝试配置 SparkContext,但不幸的是我收到了错误消息..

我写了这段代码:

from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row,SQLContext
import sys
import requests

# create spark configuration

conf = SparkConf()
conf.setAppName("TwitterStreamApp")

# create spark context with the above configuration
sc = SparkContext(conf=conf)

我得到了这个错误:

Py4JError                                 Traceback (most recent call last)
<ipython-input-97-b0f526d72e5a> in <module>
1 # create spark context with the above configuration
----> 2 sc = SparkContext(conf=conf)

~\anaconda3\lib\site-packages\pyspark\context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
133 # If an error occurs, clean up in order to allow future SparkContext creation:
134 self.stop()
--> 135 raise
136
137 def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~\anaconda3\lib\site-packages\pyspark\context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
211 self.pythonVer = "%d.%d" % sys.version_info[:2]
212
--> 213 if sys.version_info < (3, 6):
214 with warnings.catch_warnings():
215 warnings.simplefilter("once")

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __getattr__(self, name)
1528 answer, self._gateway_client, self._fqn, name)
1529 else:
-> 1530 raise Py4JError(
1531 "{0}.{1} does not exist in the JVM".format(self._fqn, name))
1532

Py4JError: org.apache.spark.api.python.PythonUtils.isEncryptionEnabled does not exist in the JVM

此外,我在系统 ENV 中添加了 JAVA_HOME、SPARK_HOME.. 但它不起作用。

最佳答案

我认为您设置的方式最终会同时运行多个 SparkContext。

试试这个简单的设置:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TwitterStreamApp').getOrCreate()

如果你不是流媒体,假设你正在阅读一个带有标题的 csv 文件

staticDF = spark.read.csv('source/file/path/here', header = True, inferSchema = True)

如果你正在流式传输,请再说一次 csv 格式

streamingDF = spark.readStream \
.schema(provide schema here)
.option('....') // whatever your options are
.csv('source/file/path/here')

您可能希望习惯在读取数据时提供或构建模式的想法,与 spark 尝试推断模式相比,它有助于提高处理速度。

关于python - 如何在 Jupyter Notebook 中正确设置 SparkContext 的配置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65400008/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com