I initiated pyspark in cmd and performed below to sharpen my skills.
我在cmd中发起了焰火,并在下面进行了表演,以提高我的技能。
C:\Users\Administrator>SUCCESS: The process with PID 5328 (child process of PID 4476) has been terminated.
SUCCESS: The process with PID 4476 (child process of PID 1092) has been terminated.
SUCCESS: The process with PID 1092 (child process of PID 3952) has been terminated.
pyspark
Python 3.11.1 (tags/v3.11.1:a7a450f, Dec 6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/08 20:07:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Python version 3.11.1 (tags/v3.11.1:a7a450f, Dec 6 2022 19:58:39)
Spark context Web UI available at http://Mohit:4040
Spark context available as 'sc' (master = local[*], app id = local-1673188677388).
SparkSession available as 'spark'.
>>> 23/01/08 20:08:10 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
a = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
When I execute a.take(1), I get "_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range" error and I am unable to find why. When same is run on google colab, it doesn't throw any error. Below is what I get in console.
当我执行a.take(1)时,我得到“_ickle.PicklingError:无法序列化对象:IndexError:tuple index out of range”错误,我找不到原因。当在Google CoLab上运行相同的程序时,它不会抛出任何错误。下面是我在控制台中得到的信息。
>>> a.take(1)
Traceback (most recent call last):
File "C:\Spark\python\pyspark\serializers.py", line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 602, in dump
return Pickler.dump(self, obj)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 692, in reducer_override
return self._function_reduce(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 546, in _dynamic_function_reduce
state = _function_getstate(func)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in <dictcomp>
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
~~~~~^^^^^^^
IndexError: tuple index out of range
Traceback (most recent call last):
File "C:\Spark\python\pyspark\serializers.py", line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 602, in dump
return Pickler.dump(self, obj)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 692, in reducer_override
return self._function_reduce(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 546, in _dynamic_function_reduce
state = _function_getstate(func)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in <dictcomp>
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
~~~~~^^^^^^^
IndexError: tuple index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Spark\python\pyspark\rdd.py", line 1883, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\context.py", line 1486, in runJob
sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\rdd.py", line 3505, in _jrdd
wrapped_func = _wrap_function(
^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\rdd.py", line 3362, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\rdd.py", line 3345, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
^^^^^^^^^^^^^^^^^^
File "C:\Spark\python\pyspark\serializers.py", line 468, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range
It should provide [1] as an answer but instead throws this error. Is it because of incorrect installation?
它应该提供[1]作为答案,但却抛出了这个错误。是因为安装不正确吗?
Package used - spark-3.3.1-bin-hadoop3.tgz, Java(TM) SE Runtime Environment (build 1.8.0_351-b10), Python 3.11.1
Package Used-spark-3.3.1-bin-hadoop3.tgz、Java(TM)SE Runtime Environment(内部版本号1.8.0_351-b10)、Python3.11.1
Can anyone help in troubleshooting this? Many thanks in advance.
有谁能帮忙解决这个问题吗?在此之前,我非常感谢您。
更多回答
Might be python version incompatible issue, can you recheck with version 3.8
?
可能是PYTHON版本不兼容的问题,你能重新检查3.8版吗?
I tried with Python 3.8.5 and now it shows a different error which Java IO Exception though I pip installed py4j with JDK already installed.
我尝试使用Python 3.8.5,现在它显示了一个不同的错误,Java IO异常,尽管我已经安装了JDK,但我安装了py4j。
I fixed downgrading to Python 3.9, then I installed pip in the version 3.9 doing python3.9 -m ensurepip
and then you can use with python3.9 -m pip install pyspark
. after that you will get an error which says you are running pyspark 3.9 with python 3.11.... it's a environment variable problem, you have to change two variables:
我修复了降级到Python3.9的问题,然后我在版本3.9中安装了pip,执行的是python3.9-m ensurepip,然后您可以使用python3.9-m的pip安装pyspark。之后,您将收到一条错误消息,提示您正在使用python3.11运行pyspark 3.9。这是一个环境变量问题,您必须更改两个变量:
I use jupyter lab in vscode so in order to have the right variables in vs code jupyterlab you have to open jupyter lab extension settings.json and put "jupyter.runStartupCommands": [ "import os\nos.environ['PYSPARK_PYTHON']='/bin/python3.9'\nos.environ['PYSPARK_DRIVER_PYTHON']='/bin/python3.9/'\n" ]
我在vscode中使用jupyter Lab,因此为了在VS代码jupyterLab中拥有正确的变量,您必须打开Jupyter Lab扩展设置.json并将“jupyter.runStartupCommands”:[“IMPORT os\nos.environ[‘PYSPARK_PYTHON’]=‘/bin/python3.9’\nos.environ[‘PYSPARK_DRIVER_PYTHON’]=‘/bin/python3.9/’\n”]
if you want to use pyspark with python 3.9 in all the system instead, you can add in .bashrc export PYSPARK_PYTHON='/bin/python3.9'
and export PYSPARK_DRIVER_PYTHON='/bin/python3.9'
如果你想在所有系统中使用pyspark和python 3.9,你可以添加.bashrc export PYSPARK_PYTHON ='/bin/python3.9'和export PYSPARK_DRIVER_PYTHON ='/bin/python3.9'
According to https://github.com/apache/spark/pull/38987 you will need Spark 3.4.0 to use Python 3.11, at the time of writing not yet released at https://spark.apache.org/downloads.html. Python 3.10 should work.
根据https://github.com/apache/spark/pull/38987的说法,你需要Spark 3.4.0才能使用Python3.11,在撰写本文时还没有在https://spark.apache.org/downloads.html.上发布Python3.10应该可以运行。
As of 3/2/23, I had the same identical problem, and as indicated above, I uninstalled python 3.11 and installed version 3.10.9 and it's solved!
在3/2/23,我遇到了相同的问题,如上所述,我卸载了python3.11并安装了3.10.9版本,问题就解决了!
更多回答
I even tried with Python 3.8.5, but same error persists. I ran standalone Spark in cmd and it works without a flaw and gives the correct output. I am running 2 versions of Python i.e. 3.8.5 and 3.11.1 with 3.8.5 set as default. It throws same error as it did. Any corrections to follow?
我甚至尝试了使用Python3.8.5,但同样的错误仍然存在。我在cmd中运行了独立的Spark,它的工作没有任何缺陷,并且给出了正确的输出。我正在运行2个版本的Python,即3.8.5和3.11.1,默认设置为3.8.5。它抛出的错误与它所做的相同。接下来还有什么需要更正的地方吗?
@MohitAswani To ensure you are absolutely not using Python 3.11 I would uninstall it completely and then see what happens, as you could have other environment variables or configuration in Spark that still point to 3.11 instead of 3.8.
@MohitAswani为了确保您绝对不使用Python3.11,我会完全卸载它,然后看看会发生什么,因为您可能在Spark中有其他环境变量或配置仍然指向3.11而不是3.8。
I have just installed Python 3.10.11 in addition to 3.11.x and used 3.10 as the base interpreter for the venv with PySpark 3.4 and Spark 3.4.0 server. It works seamlessly too.
我刚刚在3.11.x的基础上安装了Python3.10.11,并使用3.10作为venv与PySpark 3.4和Spark 3.4.0服务器的基本解释器。它也可以无缝工作。
我是一名优秀的程序员,十分优秀!