gpt4 book ai didi

azure - Pyspark Azure Blob 存储 - 未找到类 org.apache.hadoop.fs.azure.NativeAzureFileSystem

转载 作者:行者123 更新时间:2023-12-02 22:53:04 27 4
gpt4 key购买 nike

我尝试使用 Jupyter Notebook 中的 pyspark 读取 Azure Blob 存储上的 CSV 文件,但遇到以下错误:

Py4JJavaError: An error occurred while calling o34.csv. :java.lang.RuntimeException: java.lang.ClassNotFoundException: Classorg.apache.hadoop.fs.azure.NativeAzureFileSystem not found atorg.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)atorg.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)atorg.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)atorg.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) atorg.apache.hadoop.fs.Path.getFileSystem(Path.java:365) atorg.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:747)at scala.collection.immutable.List.map(List.scala:293) atorg.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:745)atorg.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:577)atorg.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)atorg.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)atorg.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)at scala.Option.getOrElse(Option.scala:189) atorg.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)atorg.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)atjava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethod) atjava.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)atjava.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.base/java.lang.reflect.Method.invoke(Method.java:566) atpy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) atpy4j.Gateway.invoke(Gateway.java:282) atpy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)at py4j.commands.CallCommand.execute(CallCommand.java:79) atpy4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)at py4j.ClientServerConnection.run(ClientServerConnection.java:106)at java.base/java.lang.Thread.run(Thread.java:829) Caused by:java.lang.ClassNotFoundException: Classorg.apache.hadoop.fs.azure.NativeAzureFileSystem not found atorg.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)atorg.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)... 29 more

以下是我遵循的步骤:我有一个可用的 kubernetes 集群。

我安装了一个 HELM 图表 JupyterHub,它似乎工作正常,我在那里安装了 Pyspark。

我安装了 HELM Chart (Bitnami) 来设置 Spark 集群。

我能够通过 Jupyter 笔记本中的 pyspark 连接到我的 Spark 集群:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://spark-master-svc:7077").getOrCreate()
spark.sparkContext

我可以在远程 Spark 中执行一些命令,没有任何问题。

我尝试读取位于 Blob 存储上的 csv 文件,但收到上面粘贴的错误消息

SECRET_ACCESS_KEY = "***"
STORAGE_NAME = "***"
file_path = "wasb://***@***.blob.core.windows.net/***.csv"

fs_acc_key = "fs.azure.account.key." + STORAGE_NAME + ".blob.core.windows.net"
spark.conf.set(fs_acc_key, SECRET_ACCESS_KEY)

df_csv = spark.read.csv(
path=file_path,
sep='|',
inferSchema=True,
header=True
)

java.lang.RuntimeException:java.lang.ClassNotFoundException:找不到类org.apache.hadoop.fs.azure.NativeAzureFileSystem

经过一番研究,我发现有必要安装多个 jar(至少是 hadoop-azure 和 azure-storage),所以我在 Dockerfile 中完成了它,如 Bitnami 文档中所述:

# https://github.com/bitnami/bitnami-docker-spark/blob/master/3/debian-10/Dockerfile
FROM bitnami/spark:3.2.0-debian-10-r73

USER root

### ADDITIONAL JARS
# https://github.com/bitnami/bitnami-docker-spark#installing-additional-jars
RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/3.3.1/hadoop-azure-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-azure-3.3.1.jar &&\
curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/8.6.6/azure-storage-8.6.6.jar --output /opt/bitnami/spark/jars/azure-storage-8.6.6.jar &&\
curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util/11.0.7/jetty-util-11.0.7.jar --output /opt/bitnami/spark/jars/jetty-util-11.0.7.jar &&\
curl https://repo1.maven.org/maven2/org/apache/hadoop/thirdparty/hadoop-shaded-guava/1.1.1/hadoop-shaded-guava-1.1.1.jar --output /opt/bitnami/spark/jars/hadoop-shaded-guava-1.1.1.jar &&\
curl https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar --output /opt/bitnami/spark/jars/httpclient-4.5.13.jar &&\
curl https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.13.1/jackson-databind-2.13.1.jars --output /opt/bitnami/spark/jars/jackson-databind-2.13.1.jars &&\
curl https://repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.13.1/jackson-core-2.13.1.jar --output /opt/bitnami/spark/jars/jackson-core-2.13.1.jar &&\
curl https://repo1.maven.org/maven2/org/eclipse/jetty/jetty-util-ajax/11.0.7/jetty-util-ajax-11.0.7.jar --output /opt/bitnami/spark/jars/jetty-util-ajax-11.0.7.jar &&\
curl https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/2.2.0.Final/wildfly-openssl-2.2.0.Final.jar --output /opt/bitnami/spark/jars/wildfly-openssl-2.2.0.Final.jar &&\
curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.1/hadoop-common-3.3.1.jar --output /opt/bitnami/spark/jars/hadoop-common-3.3.1.jar &&\
curl https://repo1.maven.org/maven2/com/microsoft/azure/azure-keyvault-core/1.2.6/azure-keyvault-core-1.2.6.jar --output /opt/bitnami/spark/jars/azure-keyvault-core-1.2.6.jar

USER 1001

我重新部署了 Spark 集群,jar 位于预期文件夹中

但是,我仍然遇到同样的错误:

java.lang.RuntimeException:java.lang.ClassNotFoundException:找不到类org.apache.hadoop.fs.azure.NativeAzureFileSystem

我尝试了很多在 stackoverflow 上找到的配置,但仍然得到相同的结果。

spark = SparkSession.builder.master("spark://spark-master-svc:7077") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure-3.3.1,com.microsoft.azure:azure-storage:8.6.6").getOrCreate()

spark = SparkSession.builder.master("spark://spark-master-svc:7077") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure-3.3.1").getOrCreate()

spark.sparkContext._conf.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext._conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext._conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")

无论我尝试什么配置,当我尝试读取 CSV 文件时,都会收到相同的错误消息。

我真的不知道该尝试什么了,肯定有一些事情我没有注意到。

我希望这里有人可以帮助我?

最佳答案

已修复
我也遇到了同样的问题。
这样做解决了我的问题:

:
spark = SparkSession.builder.master("spark://spark-master-svc:7077")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure-3.3.1,com.microsoft.azure:azure-storage:8.6.6").getOrCreate()
< br/>:
spark = SparkSession.builder.master("spark://spark-master-svc:7077")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.3.1,com.microsoft.azure:azure-storage:8.6.6").getOrCreate()

在配置中,对于 hadoop azure 应遵循 Maven 存储库命名约定 Maven repo naming convention .
因此将“-”更改为“:”即可。

关于azure - Pyspark Azure Blob 存储 - 未找到类 org.apache.hadoop.fs.azure.NativeAzureFileSystem,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71008716/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com