gpt4 book ai didi

python-3.x - 使用 PySpark 和 Jupyter 使用 Spacy 解析文本时出错

转载 作者:行者123 更新时间:2023-12-04 00:41:51 25 4
gpt4 key购买 nike

我正在尝试使用 spacy 解析一些文本以获得单词依赖关系。我正在使用 Jupyter 笔记本在 Anaconda 中运行 PySpark。

  • Python版本:3.7.5
  • PySpark 版本:2.4.4
  • 空间版本:2.2.5
  • python 版本:4.7.12
  • Jupyter版本:6.0.2

    这是错误的 MVCE:

    import spacy
    import en_core_web_sm
    from pyspark.sql.functions import *
    from pyspark.sql.types import *

    def get_token_dep(text):
    if text:
    nlp = en_core_web_sm.load()
    return [(token.text, token.tag_, token.head.text, token.dep_) for token in nlp(text)]
    else:
    return [['N/A']]
    get_token_dep_udf = udf(get_token_dep, ArrayType(ArrayType(StringType())))

    text_list = ['Chocolate is a food made from cacao beans.', 'Dessert is a course that concludes a meal.']
    text_df = spark.createDataFrame(text_list, StringType())

    text_df = text_df.withColumnRenamed(
    'value', 'text'
    ).withColumn(
    'parsed_text', get_token_dep_udf('text')
    )

    display(text_df.toPandas())

    但是,我收到如下错误:

    ---------------------------------------------------------------------------
    Py4JJavaError Traceback (most recent call last)
    <ipython-input-14-bc4e37a4051a> in <module>
    ----> 1 display(text_df.toPandas())

    ~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\pyspark\sql\dataframe.py in toPandas(self)
    2141
    2142 # Below is toPandas without Arrow optimization.
    -> 2143 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
    2144
    2145 dtype = {}

    ~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\pyspark\sql\dataframe.py in collect(self)
    532 """
    533 with SCCallSiteSync(self._sc) as css:
    --> 534 sock_info = self._jdf.collectToPython()
    535 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
    536

    ~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
    1255 answer = self.gateway_client.send_command(command)
    1256 return_value = get_return_value(
    -> 1257 answer, self.gateway_client, self.target_id, self.name)
    1258
    1259 for temp_arg in temp_args:

    ~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    61 def deco(*a, **kw):
    62 try:
    ---> 63 return f(*a, **kw)
    64 except py4j.protocol.Py4JJavaError as e:
    65 s = e.java_exception.toString()

    ~\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326 raise Py4JJavaError(
    327 "An error occurred while calling {0}{1}{2}.\n".
    --> 328 format(target_id, ".", name), value)
    329 else:
    330 raise Py4JError(

    Py4JJavaError: An error occurred while calling o147.collectToPython.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\catalogue.py", line 8, in <module>
    import importlib.metadata as importlib_metadata
    ModuleNotFoundError: No module named 'importlib.metadata'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 366, in main
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 241, in read_udfs
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 168, in read_single_udf
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 69, in read_command
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py", line 172, in _read_with_length
    return self.loads(obj)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py", line 580, in loads
    return pickle.loads(obj, encoding=encoding)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\cloudpickle.py", line 875, in subimport
    __import__(name)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\en_core_web_sm\__init__.py", line 5, in <module>
    from spacy.util import load_model_from_init_py, get_model_meta
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\__init__.py", line 12, in <module>
    from . import pipeline
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\pipeline\__init__.py", line 4, in <module>
    from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
    File "pipes.pyx", line 1, in init spacy.pipeline.pipes
    File "strings.pxd", line 23, in init spacy.syntax.nn_parser
    File "strings.pyx", line 17, in init spacy.strings
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\util.py", line 16, in <module>
    import catalogue
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\catalogue.py", line 10, in <module>
    import importlib_metadata
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 547, in <module>
    __version__ = version(__name__)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 509, in version
    return distribution(distribution_name).version
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
    return Distribution.from_name(distribution_name)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
    dist = next(dists, None)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
    for path in map(cls._switch_path, paths)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
    if not root.is_dir():
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\pathlib.py", line 1358, in is_dir
    return S_ISDIR(self.stat().st_mode)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\pathlib.py", line 1168, in stat
    return self._accessor.stat(self)
    OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\C:\\Users\\user1\\AppData\\Local\\Continuum\\anaconda3\\envs\\py37\\Lib\\site-packages\\pyspark\\jars\\spark-core_2.11-2.4.4.jar'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3263)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3260)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3260)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\catalogue.py", line 8, in <module>
    import importlib.metadata as importlib_metadata
    ModuleNotFoundError: No module named 'importlib.metadata'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 366, in main
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 241, in read_udfs
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 168, in read_single_udf
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 69, in read_command
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py", line 172, in _read_with_length
    return self.loads(obj)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py", line 580, in loads
    return pickle.loads(obj, encoding=encoding)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\cloudpickle.py", line 875, in subimport
    __import__(name)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\en_core_web_sm\__init__.py", line 5, in <module>
    from spacy.util import load_model_from_init_py, get_model_meta
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\__init__.py", line 12, in <module>
    from . import pipeline
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\pipeline\__init__.py", line 4, in <module>
    from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
    File "pipes.pyx", line 1, in init spacy.pipeline.pipes
    File "strings.pxd", line 23, in init spacy.syntax.nn_parser
    File "strings.pyx", line 17, in init spacy.strings
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\spacy\util.py", line 16, in <module>
    import catalogue
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\catalogue.py", line 10, in <module>
    import importlib_metadata
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 547, in <module>
    __version__ = version(__name__)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 509, in version
    return distribution(distribution_name).version
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
    return Distribution.from_name(distribution_name)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
    dist = next(dists, None)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
    for path in map(cls._switch_path, paths)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
    if not root.is_dir():
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\pathlib.py", line 1358, in is_dir
    return S_ISDIR(self.stat().st_mode)
    File "C:\Users\user1\AppData\Local\Continuum\anaconda3\envs\py37\lib\pathlib.py", line 1168, in stat
    return self._accessor.stat(self)
    OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\C:\\Users\\user1\\AppData\\Local\\Continuum\\anaconda3\\envs\\py37\\Lib\\site-packages\\pyspark\\jars\\spark-core_2.11-2.4.4.jar'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

    我已尝试将 Python 升级到 3.8,但 Jupyter notebooks 尚不支持较新的 Python 版本。任何人都可以在 Jupyter 笔记本上使用 PySpark 吗?

  • 最佳答案

    部分错误指向https://github.com/explosion/catalogue/blob/master/catalogue.py#L7 ,其中 importlib.metadata 的导入似乎出错了,但不是预期的错误类型 ImportError。我将制作一个包含 ModuleNotFoundError 的 PR,希望这能解决问题!

    [EDIT:] 嗯,ModuleNotFoundErrorImportError 的子类,所以我不明白为什么它没有被正确捕获在 except block :|

    [编辑 2:] 记录了一个问题 https://github.com/explosion/catalogue/issues/4如果这确实与 catalogue.py

    有关

    关于python-3.x - 使用 PySpark 和 Jupyter 使用 Spacy 解析文本时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59386397/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com